Skip to content

fix testing if node has gpu support #1604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Jul 5, 2020

What do these changes do?

executing docker node inspect self is not allowed on non-manager nodes in a swarm.
Therefore, alternative proposal is to try run a nvidia-smi container that will fail if the nvidia runtime is not set as default on the node.

@GitHK : please test on your GPU enabled machine.

fixes #1603 (after being tested by @GitHK )

Related issue number

How to test

Checklist

  • Did you change any service's API? Then make sure to bundle document and upgrade version (make openapi-specs, git commit ... and then make version-*)
  • Unit tests for the changes exist
  • Runs in the swarm
  • Documentation reflects the changes
  • New module? Add your github username to .github/CODEOWNERS

@sanderegg sanderegg added this to the Huo Guo milestone Jul 5, 2020
@sanderegg sanderegg requested a review from GitHK July 5, 2020 19:10
@sanderegg sanderegg self-assigned this Jul 5, 2020
@codecov
Copy link

codecov bot commented Jul 5, 2020

Codecov Report

Merging #1604 into master will decrease coverage by 0.0%.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #1604     +/-   ##
========================================
- Coverage    73.7%   73.7%   -0.1%     
========================================
  Files         278     278             
  Lines       10874   10854     -20     
  Branches     1181    1175      -6     
========================================
- Hits         8015    8000     -15     
+ Misses       2516    2514      -2     
+ Partials      343     340      -3     
Flag Coverage Δ
#integrationtests 56.8% <46.1%> (+0.1%) ⬆️
#unittests 67.3% <100.0%> (-0.1%) ⬇️
Impacted Files Coverage Δ
...vices/sidecar/src/simcore_service_sidecar/utils.py 93.2% <100.0%> (+3.3%) ⬆️
.../director/src/simcore_service_director/producer.py 63.2% <0.0%> (+0.2%) ⬆️

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it looks good on my machine.

@sanderegg sanderegg requested a review from pcrespov July 6, 2020 07:47

logger.info("Node GPU support: %s", has_gpu_support)
return has_gpu_support
config = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I guess this image will never block when boots

"Tty": False,
"OpenStdin": False,
}
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: to suppress exceptions sometimes is handy and more readable

from contextlib import suppress
with suppress(aiodocker.execptions.DockerError):
    await ...
    return True
return False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not realize this existed. cool thing! but for this pre-new-sidecar era I will keep it for a next time.

@sanderegg sanderegg merged commit 2594bfe into ITISFoundation:master Jul 7, 2020
@sanderegg sanderegg deleted the bugfix/detecting_gpu_on_node branch July 7, 2020 06:53
@odeimaiz odeimaiz mentioned this pull request Aug 4, 2020
odeimaiz added a commit that referenced this pull request Aug 4, 2020
- UI/UX improvements (#1657)
- Bump yarl from 1.4.2 to 1.5.1 in /packages/postgres-database (#1665)
- Bump ujson from 3.0.0 to 3.1.0 in /packages/service-library (#1664)
- Bump pytest-docker from 0.7.2 to 0.8.0 in /packages/service-library (#1647)
- Improving storage performance (#1659)
- Bump aiozipkin from 0.6.0 to 0.7.0 in /packages/service-library (#1642)
- Theming (#1656)
- Platform stability:  (#1645)
- is1594 fix and re-activate e2e testing (#1620)
- 2 bugs fixed + Some improvements (#1634)
- Fixes default (#1640)
- Bump lodash from 4.17.15 to 4.17.19 (#1639)
- Is1585/cleanup storage (#1586)
- Fixes on publish studies handling (#1632)
- Some enhancements and bug fixes (#1608)
- Improve e2e  (#1631)
- filter studies by name before deleting them (#1629)
- Maintenance/upgrades test tools (#1628)
- Bugfix/concurent opening projects (#1598)
- Bugfix/allow reading groups anonymous user (#1615)
- Bump docker from 4.2.1 to 4.2.2 in /packages/postgres-database (#1605)
- fix testing if node has gpu support (#1604)
- [bugfix] Invalidate cache before starting a study (#1602)
- Feature/fix e2e 2 (#1600)
- fix deploy not needing e2e testing since it is disabled
- reduce cardinality of metrics (#1593)
- Excudes e2e stage from include until fixed (#1595)
- Shared project concurrency (frontend) (#1591)
- Homogenize studies and services (#1569)
- [feature] UI Fine grained access - project locking and notification
- Bugfix/apiserver does not need sslheaders (#1564)
- Cleanup catalog service (#1582)
- Maintenance/cleanup api server (#1578)
- Adds support for GPU scheduling of computational services (#1553)
- Maintenance/upgrades and tooling (#1546)
- Is1570/study fails 500 (#1572)
- Bump faker from 4.1.0 to 4.1.1 in /packages/postgres-database (#1573)
- maintenance fix codecov reports (#1568)
- Manage groups, Share studies (#1512)
- Is/add notebook migration script (#1565)
- Is1269/api-server upgrade (#1475)
- added simcore_webserver_service in pytest simcore package (#1563)
- add traefik endpoint to api-gateway (#1555)
@sanderegg sanderegg mentioned this pull request Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check for GPU in sidecar breaks on non-manager nodes
3 participants