Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix git pull error in core contrib test #3357

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jomcgi
Copy link
Contributor

@jomcgi jomcgi commented Mar 12, 2025

Description

Core contrib is intermittently failing when pulling the core repo in tox.
Add gh actions/checkout for the core repo before running tox to mitigate this.

As the error is transient I'm not 100% sure that this resolves it but I have not encountered it with any of the GH Actions that have been triggered.

Fixes #3352

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

  • Tested with CI on this PR
  • Tested CI from core repo with latest commit (successful workflow)
  • Verified logs in workflow above include git+file://.. references instead of the remote repo

Does This PR Require a Core Repo Change?

  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch 2 times, most recently from ff2f9e6 to 021c31b Compare March 12, 2025 16:22
@jomcgi jomcgi changed the title [WIP] Fix transient CI errors with tox/git retry Fix transient CI errors with tox/git retry Mar 12, 2025
@jomcgi jomcgi marked this pull request as ready for review March 12, 2025 17:29
@jomcgi jomcgi requested a review from a team as a code owner March 12, 2025 17:29
Copy link
Member

@emdneto emdneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea for this one would be just do a actions/checkout first. If checkout works, it means will be able to install things from that revision

@jomcgi
Copy link
Contributor Author

jomcgi commented Mar 12, 2025

My idea for this one would be just do a actions/checkout first. If checkout works, it means will be able to install things from that revision

We already do a checkout before hitting this error, the actions/checkout action has a retry built in which is why we do not see this error there.

I couldn't see a flag for tox that would let us use the output of that checkout in tox -e ...

    runs-on: ubuntu-latest
    steps:
      - name: Checkout contrib repo @ SHA - ${% raw %}{{ env.CONTRIB_REPO_SHA }}{% endraw %}
        uses: actions/checkout@v4
        with:
          repository: open-telemetry/opentelemetry-python-contrib
          ref: ${% raw %}{{ env.CONTRIB_REPO_SHA }}{% endraw %}

      - name: Set up Python 3.8
        uses: actions/setup-python@v5
        with:
          python-version: "3.8"
          architecture: "x64"

      - name: Install tox
        run: pip install tox-uv

      - name: Run tests
        run: tox -e {{ job_data.tox_env }} -- -ra

The revision always exists (it's used for all of the other actions that run successfully and the previous checkout step) but we fail without retry when tox pulls it in Run tests.

@emdneto
Copy link
Member

emdneto commented Mar 12, 2025

My idea for this one would be just do a actions/checkout first. If checkout works, it means will be able to install things from that revision

We already do a checkout before hitting this error, the actions/checkout action has a retry built in which is why we do not see this error there.

I couldn't see a flag for tox that would let us use the output of that checkout in tox -e ...

    runs-on: ubuntu-latest
    steps:
      - name: Checkout contrib repo @ SHA - ${% raw %}{{ env.CONTRIB_REPO_SHA }}{% endraw %}
        uses: actions/checkout@v4
        with:
          repository: open-telemetry/opentelemetry-python-contrib
          ref: ${% raw %}{{ env.CONTRIB_REPO_SHA }}{% endraw %}

      - name: Set up Python 3.8
        uses: actions/setup-python@v5
        with:
          python-version: "3.8"
          architecture: "x64"

      - name: Install tox
        run: pip install tox-uv

      - name: Run tests
        run: tox -e {{ job_data.tox_env }} -- -ra

The revision always exists (it's used for all of the other actions that run successfully and the previous checkout step) but we fail without retry when tox pulls it in Run tests.

I mean, checkout core repo at the core repo revision we want before the contrib one. We only have this problem for core_contrib_tests, which we test instrumentation libraries against a revision of Core. So, checkout core at that revision and use it

@jomcgi
Copy link
Contributor Author

jomcgi commented Mar 13, 2025

I mean, checkout core repo at the core repo revision we want before the contrib one. We only have this problem for core_contrib_tests, which we test instrumentation libraries against a revision of Core. So, checkout core at that revision and use it

Understood - I'll update my PR so that it only adds that checkout for core_contrib tests.

@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch from 021c31b to 677d607 Compare March 13, 2025 07:54
@jomcgi jomcgi changed the title Fix transient CI errors with tox/git retry Fix git pull error in core contrib test Mar 13, 2025
@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch 2 times, most recently from 9932189 to 5764d8b Compare March 13, 2025 08:04
Copy link
Member

@emdneto emdneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jomcgi I think this can works. But how we can make tox use the local repo to install packages instead of pointing to the remote repository? I'm wondering about configuring CORE_REPO to point like: git+file:///absolute/path/opentelemetry-python

@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch 2 times, most recently from 29d8889 to 47a03b9 Compare March 26, 2025 08:52
Core contrib is intermittently failing when
pulling the core repo in tox.
Add gh actions/checkout for the core repo
before running tox to mitigate this.
@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch from 47a03b9 to 01d7ae6 Compare March 26, 2025 09:14
@jomcgi
Copy link
Contributor Author

jomcgi commented Mar 26, 2025

Good callout - I've updated tox.ini to allow passing the core rep as an environment variable.
Successful run from opentelemtry-python using the my latest commit 01d7ae6:
https://github.com/jomcgi/opentelemetry-python/actions/runs/14079667070/job/39429499877

Logs show the expected git+file://.. reference instead of the remote repo:

py38-test-instrumentation-openai-v2-latest: 
uv pip install 
'opentelemetry-api@ git+file:///home/runner/work/opentelemetry-python/opentelemetry-python/opentelemetry-python@8cb080cf886b1bb897c2baed6cb76d826532cba4#egg=opentelemetry-api&subdirectory=opentelemetry-api' 
'opentelemetry-sdk@ git+file:///home/runner/work/opentelemetry-python/opentelemetry-python/opentelemetry-python@8cb080cf886b1bb897c2baed6cb76d826532cba4#egg=opentelemetry-sdk&subdirectory=opentelemetry-sdk' 

@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch 2 times, most recently from 9bac288 to 9cd105c Compare April 9, 2025 16:54
Declare env var for the workflow instead of each invidividual job to reduce unnecessary repetition.
@jomcgi jomcgi force-pushed the bug-3352/fix-transient-ci-git-errors branch from 9cd105c to 3f08866 Compare April 9, 2025 17:06
@jomcgi
Copy link
Contributor Author

jomcgi commented Apr 9, 2025

Passing test from the core repo with the latest commit:
https://github.com/jomcgi/opentelemetry-python/actions/runs/14362978923

@@ -30,6 +31,13 @@ jobs:
repository: open-telemetry/opentelemetry-python-contrib
ref: ${% raw %}{{ env.CONTRIB_REPO_SHA }}{% endraw %}

- name: Checkout core repo @ SHA - ${% raw %}{{ env.CORE_REPO_SHA }}{% endraw %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are performing a checkout action for each job? Wouldn't this add quite a bit of overhead to the build run time?

According to @emdneto , the checkout action takes ~1s each. This will also grow linearly with the amount of instrumentations we have.

image

Seeing as we can't prove that this fixes the transient issue easily, and it's only solving a very small percentage of runs (which is to run the test again), I'm not sure if this change is worth the increased run time for ALL PR runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific error has happend in 7 out the last 50 runs which indicates it could account for a significant portion of that tests 34% failure rate:
image

The average runner time consumed by each execution of this test over the last 30 days was 85 minutes so the retries will be be consuming a significant amount of time as well.

Maybe I'm missing permissions but when I run into this error I can't retry failed steps I have to run everything again.


I can investigate this further - it should be possible to retrieve the data required to quantify the impact (retries) and the frequency of this error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzchen, as we discussed in Slack: Before, we were using two checkouts -- I removed it in #3035 because at the time I thought there was no usage 😅 So we did an "optimization" by removing the second checkout but in the other hand, it probably led to this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test if this PR solves the issue by opening a PR in core pointing to this branch and I can trigger the CI a couple of times. from what I observed, every PR has at least ~1 error occurrence

Copy link
Member

@emdneto emdneto Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jomcgi https://github.com/open-telemetry/opentelemetry-python/actions/runs/14369769807

I left the workflow pointing to main on purpose to see if it would fail. It is sill using the git+https://, and there are no fails for ~5 runs -- Probably all we need is to do the checkout as was done before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

infra: core-contrib-tests randomly fails
3 participants