Skip to content

test: reduce TAV test matrix for slowest jobs #3321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 5, 2023

Conversation

trentm
Copy link
Member

@trentm trentm commented May 4, 2023

This tweaks the .tav.yml config for most TAV test matrix jobs that were
observed to be taking >20 minutes in CI recently. One job was taking
longer than 40 minutes and being aborted:
test-tav (14, bluebird,got,mimic-response)

  • mimic-response: The agent instruments and will only ever instrument a
    single version: v1.0.0. Therefore there is no need for to "test all
    versions" as long as the regular non-TAV tests test that version. Lock
    package.json to that version.

  • bluebird, got, pg, mongodb-core, knex, tedious: Massively reduce the
    number of versions tested when the latest release for a given major
    version was more than 2-3 years ago.

  • fastify, aws-sdk: Update the tested ranges to test the first, last,
    and approximately 5 versions in between. (Added
    ./dev-utils/tav-versions-fastify.sh for this.)

Also:

  • Fix a test in bluebird.test.js to be less flaky. In a local run it was
    observed to fail the 'new Promise -> timeout (timed out)' test case
    with:
    Error: start + 49 should be <= 1683227232779 - was 1683227232782.

Details

Here is a table of the slowest TAV jobs on a recent run in CI. These are all
those that took >20 minutes on that run. The latest column is a count of the
number of module versions that were tested for each job.

TIME        TEST JOB                                    NUM MODULE VERSIONS TESTED
----------- ------------------------------------------  --------------------------
1217 20m17s test-tav (10, pg)                            46 versions
1217 20m17s test-tav (14, next)                          29 versions *not changed*
1255 20m55s test-tav (18, bluebird,got,mimic-response)  223 versions
1268 21m08s test-tav (14, express)                       83 versions
1290 21m30s test-tav (8, pg)                             46 versions
1306 21m46s test-tav (10, tedious)                      116 versions
1321 22m01s test-tav (10, mongodb-core)                 102 versions
1339 22m19s test-tav (12, pg)                            46 versions
1352 22m32s test-tav (14, tedious)                      143 versions -> 21
1370 22m50s test-tav (8, knex)                           68 versions
1390 23m10s test-tav (12, mongodb-core)                 102 versions
1410 23m30s test-tav (14, knex)                         106 versions
1454 24m14s test-tav (8, mongodb-core)                  102 versions -> 8
1539 25m39s test-tav (14, aws-sdk)                       77 versions
1543 25m43s test-tav (12, knex)                         106 versions
1657 27m37s test-tav (8, aws-sdk)                        77 versions
1666 27m46s test-tav (10, aws-sdk)                       77 versions
1671 27m51s test-tav (10, knex)                         106 versions -> 6
1707 28m27s test-tav (8, bluebird,got,mimic-response)   223 versions
1820 30m20s test-tav (12, aws-sdk)                       77 versions
1900 31m40s test-tav (12, bluebird,got,mimic-response)  223 versions
2001 33m21s test-tav (14, fastify)                      125 versions -> 16
2042 34m02s test-tav (10, bluebird,got,mimic-response)  223 versions
2414 40m14s test-tav (14, bluebird,got,mimic-response)  223 versions -> 5

For example for the test-tav (14, tedious) job there were 143 versions of
tedious that were installed and tested. That's a waste of resources and time.
The -> 21 added for some of the rows shows the reduced number of versions after
the changes in this PR. I only added that for some.

(Note to self: I calculated these number of module versions tests with a command like:
TAV=tedious node14 ~/tm/test-all-versions/index.js -n --verbose | \rg '^\{' | json -ga numVersions | paste -s -d+ - | bc.)

Detail on the bluebird.test.js change

In local testing I noticed this flaky test failure:

node_tests_1  | # new Promise -> timeout (reject in time)
node_tests_1  | ok 531 should be strictly equal
node_tests_1  | ok 532 should be strictly equal
node_tests_1  | ok 533 should be strictly equal
node_tests_1  | ok 534 should be strictly equal
node_tests_1  | # new Promise -> timeout (timed out)
node_tests_1  | not ok 535 start + 49 should be <= 1683227232779 - was 1683227232782
node_tests_1  |   ---
node_tests_1  |     operator: ok
node_tests_1  |     expected: true
node_tests_1  |     actual:   false
node_tests_1  |     at: <anonymous> (/app/test/instrumentation/modules/bluebird/bluebird.test.js:1156:9)
node_tests_1  |     stack: |-
node_tests_1  |       Error: start + 49 should be <= 1683227232779 - was 1683227232782
node_tests_1  |           at Test.assert [as _assert] (/app/node_modules/tape/lib/test.js:312:48)
node_tests_1  |           at Test.bound [as _assert] (/app/node_modules/tape/lib/test.js:95:17)
node_tests_1  |           at Test.assert (/app/node_modules/tape/lib/test.js:431:7)
node_tests_1  |           at Test.bound [as ok] (/app/node_modules/tape/lib/test.js:95:17)
node_tests_1  |           at /app/test/instrumentation/modules/bluebird/bluebird.test.js:1156:9
node_tests_1  |           at /app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:49
node_tests_1  |           at AsyncLocalStorage.run (async_hooks.js:303:14)
node_tests_1  |           at AsyncLocalStorageRunContextManager.with (/app/lib/instrumentation/run-context/AsyncLocalStorageRunContextManager.js:57:36)
node_tests_1  |           at wrapper (/app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:23)
node_tests_1  |           at /app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:49
node_tests_1  |           at AsyncLocalStorage.run (async_hooks.js:303:14)
node_tests_1  |           at AsyncLocalStorageRunContextManager.with (/app/lib/instrumentation/run-context/AsyncLocalStorageRunContextManager.js:57:36)
node_tests_1  |           at wrapper (/app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:23)
node_tests_1  |           at tryCatch1 (/app/node_modules/bluebird/js/main/util.js:29:21)
node_tests_1  |           at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/main/promise.js:588:13)
node_tests_1  |           at Promise._settlePromiseAt (/app/node_modules/bluebird/js/main/promise.js:752:18)
node_tests_1  |           at Promise._settlePromises (/app/node_modules/bluebird/js/main/promise.js:869:14)
node_tests_1  |           at Async._drainQueue (/app/node_modules/bluebird/js/main/async.js:78:16)
node_tests_1  |           at Async._drainQueues (/app/node_modules/bluebird/js/main/async.js:88:10)
node_tests_1  |           at Immediate.Async.drainQueues [as _onImmediate] (/app/node_modules/bluebird/js/main/async.js:13:14)
node_tests_1  |           at processImmediate (internal/timers.js:464:21)
node_tests_1  |           at process.callbackTrampoline (internal/async_hooks.js:130:17)
node_tests_1  |   ...
node_tests_1  | ok 536 should be truthy
node_tests_1  | ok 537 should be strictly equal
node_tests_1  | not ok 538 start + 49 should be <= 1683227232782 - was 1683227232783
node_tests_1  |   ---
node_tests_1  |     operator: ok
node_tests_1  |     expected: true
node_tests_1  |     actual:   false
node_tests_1  |     at: <anonymous> (/app/test/instrumentation/modules/bluebird/bluebird.test.js:1156:9)
node_tests_1  |     stack: |-
node_tests_1  |       Error: start + 49 should be <= 1683227232782 - was 1683227232783
node_tests_1  |           at Test.assert [as _assert] (/app/node_modules/tape/lib/test.js:312:48)
node_tests_1  |           at Test.bound [as _assert] (/app/node_modules/tape/lib/test.js:95:17)
node_tests_1  |           at Test.assert (/app/node_modules/tape/lib/test.js:431:7)
node_tests_1  |           at Test.bound [as ok] (/app/node_modules/tape/lib/test.js:95:17)
node_tests_1  |           at /app/test/instrumentation/modules/bluebird/bluebird.test.js:1156:9
node_tests_1  |           at /app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:49
node_tests_1  |           at AsyncLocalStorage.run (async_hooks.js:303:14)
node_tests_1  |           at AsyncLocalStorageRunContextManager.with (/app/lib/instrumentation/run-context/AsyncLocalStorageRunContextManager.js:57:36)
node_tests_1  |           at wrapper (/app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:23)
node_tests_1  |           at /app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:49
node_tests_1  |           at AsyncLocalStorage.run (async_hooks.js:314:14)
node_tests_1  |           at AsyncLocalStorageRunContextManager.with (/app/lib/instrumentation/run-context/AsyncLocalStorageRunContextManager.js:57:36)
node_tests_1  |           at wrapper (/app/lib/instrumentation/run-context/AbstractRunContextManager.js:95:23)
node_tests_1  |           at tryCatch1 (/app/node_modules/bluebird/js/main/util.js:29:21)
node_tests_1  |           at Promise._settlePromiseFromHandler (/app/node_modules/bluebird/js/main/promise.js:588:13)
node_tests_1  |           at Promise._settlePromiseAt (/app/node_modules/bluebird/js/main/promise.js:752:18)
node_tests_1  |           at Promise._settlePromises (/app/node_modules/bluebird/js/main/promise.js:869:14)
node_tests_1  |           at Async._drainQueue (/app/node_modules/bluebird/js/main/async.js:78:16)
node_tests_1  |           at Async._drainQueues (/app/node_modules/bluebird/js/main/async.js:88:10)
node_tests_1  |           at Immediate.Async.drainQueues [as _onImmediate] (/app/node_modules/bluebird/js/main/async.js:13:14)
node_tests_1  |           at processImmediate (internal/timers.js:464:21)
node_tests_1  |           at process.callbackTrampoline (internal/async_hooks.js:130:17)
node_tests_1  |   ...
node_tests_1  | ok 539 should be truthy
node_tests_1  | ok 540 should be strictly equal
node_tests_1  | # new Promise -> reject -> tap -> catch
node_tests_1  | ok 541 should be strictly equal

The bluebird.test.js change is to hopefully avoid this possible flaky failure in the future.

This tweaks the .tav.yml config for most TAV test matrix jobs that were
observed to be taking >20 minutes in CI recently. One job was taking
longer than 40 minutes and being aborted:
    test-tav (14, bluebird,got,mimic-response)

- mimic-response: The agent instruments and will only ever instrument a
    single version: v1.0.0. Therefore there is no need for to "test all
    versions" as long as the regular non-TAV tests test that version. Lock
    package.json to that version.

- bluebird, got, pg, mongodb-core, knex, tedious: Massively reduce the
    number of versions tested when the latest release for a given major
    version was more than 2-3 years ago.
- fastify, aws-sdk: Update the tested ranges to test the first, last,
    and approximately 5 versions in between. (Added
    ./dev-utils/tav-versions-fastify.sh for this.)

Also:
- Fix a test in bluebird.test.js to be less flaky. In a local run it was
    observed to fail the 'new Promise -> timeout (timed out)' test case
    with:
    `Error: start + 49 should be <= 1683227232779 - was 1683227232782`.
@trentm trentm requested a review from david-luna May 4, 2023 23:47
@trentm trentm self-assigned this May 4, 2023
@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label May 4, 2023
@@ -0,0 +1,42 @@
#!/bin/sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have already a couple of files with the same logic. What do you think about a more generic script in a follow up issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. I am kind of hoping to spend some time on the side improving tav to support functionality like this natively. That's vapourware right now, however.

@trentm trentm merged commit e36f003 into main May 5, 2023
@trentm trentm deleted the trentm/tav-slow-jobs-reduce-matrix branch May 5, 2023 16:08
@trentm
Copy link
Member Author

trentm commented May 5, 2023

Timings from the TAV run after this merge (https://github.com/elastic/apm-agent-nodejs/actions/runs/4895356707):

...
934 15m34s test-tav (12, graphql)
935 15m35s test-tav (14, express)
950 15m50s test-tav (14, handlebars,pug)
954 15m54s test-tav (12, ws)
979 16m19s test-tav (10, graphql)
988 16m28s test-tav (14, redis)
999 16m39s test-tav (8, ws)
1014 16m54s test-tav (8, graphql)
1032 17m12s test-tav (14, graphql)
1046 17m26s test-tav (16, next)
1142 19m02s test-tav (18, next)
1176 19m36s test-tav (14, next)
1267 21m07s test-tav (8, express)
2413 40m13s test-tav (16, knex)
2413 40m13s test-tav (20, knex)
2414 40m14s test-tav (10, knex)
2414 40m14s test-tav (12, knex)
2414 40m14s test-tav (14, knex)
2414 40m14s test-tav (18, knex)
2416 40m16s test-tav (8, knex)

So we had some success.

  • something going on with knex that is weird
  • express, next, etc. are likely next targets for improving

v1v added a commit to v1v/apm-agent-nodejs that referenced this pull request May 8, 2023
…re/support-specific-modules

* 'main' of github.com:elastic/apm-agent-nodejs: (54 commits)
  chore: fix dev-utils/ci-tav-slow-jobs.sh (elastic#3319)
  test: reduce TAV test matrix for slowest jobs (elastic#3321)
  chore: sync package-lock so 'npm ci' can work (elastic#3318)
  docs: document `useElasticTraceparentHeader` config var (elastic#3316)
  chore, test: test driver improvements (elastic#3293)
  test: drop node 14 from RC tests now that it is EOL (elastic#3315)
  test: fix running fastify.test.js with node v8 (elastic#3317)
  feat: add @apollo/server@4 support (elastic#3203)
  chore: update nvm (elastic#3309)
  tests: stop testing 'express-graphql' instrumentation (elastic#3304)
  chore: fix bitrot.js dev util for recent changes (elastic#3308)
  test: restore testing of Azure Functions on node >=18.x (elastic#3307)
  fix: support Lambda instrumentation for `contextManager: 'patch'`; refactor Lambda tests (elastic#3305)
  test: fix fastify TAV test failures (elastic#3314)
  test: fix @aws-sdk/client-s3 TAV test failures (elastic#3312)
  feat: add instrumentation for aws-sdk S3 client (elastic#3287)
  feat(fastify): add captureBody support (elastic#2681)
  feat: mysql2@3 support (elastic#3301)
  chore(deps): bump @opentelemetry/exporter-prometheus from 0.37.0 to 0.38.0 in /test/opentelemetry-metrics/fixtures (elastic#3295)
  chore(deps-dev): bump fastify from 4.16.3 to 4.17.0 (elastic#3296)
  ...
trentm added a commit that referenced this pull request May 16, 2023
This tweaks the .tav.yml config for most TAV test matrix jobs that were
observed to be taking >20 minutes in CI recently. One job was taking
longer than 40 minutes and being aborted:
    test-tav (14, bluebird,got,mimic-response)

- mimic-response: The agent instruments and will only ever instrument a
    single version: v1.0.0. Therefore there is no need for to "test all
    versions" as long as the regular non-TAV tests test that version. Lock
    package.json to that version.

- bluebird, got, pg, mongodb-core, knex, tedious: Massively reduce the
    number of versions tested when the latest release for a given major
    version was more than 2-3 years ago.
- fastify, aws-sdk: Update the tested ranges to test the first, last,
    and approximately 5 versions in between. (Added
    ./dev-utils/tav-versions-fastify.sh for this.)

Also:
- Fix a test in bluebird.test.js to be less flaky. In a local run it was
    observed to fail the 'new Promise -> timeout (timed out)' test case
    with:
    `Error: start + 49 should be <= 1683227232779 - was 1683227232782`.
PeterEinberger pushed a commit to fpm-git/apm-agent-nodejs that referenced this pull request Aug 20, 2024
This tweaks the .tav.yml config for most TAV test matrix jobs that were
observed to be taking >20 minutes in CI recently. One job was taking
longer than 40 minutes and being aborted:
    test-tav (14, bluebird,got,mimic-response)

- mimic-response: The agent instruments and will only ever instrument a
    single version: v1.0.0. Therefore there is no need for to "test all
    versions" as long as the regular non-TAV tests test that version. Lock
    package.json to that version.

- bluebird, got, pg, mongodb-core, knex, tedious: Massively reduce the
    number of versions tested when the latest release for a given major
    version was more than 2-3 years ago.
- fastify, aws-sdk: Update the tested ranges to test the first, last,
    and approximately 5 versions in between. (Added
    ./dev-utils/tav-versions-fastify.sh for this.)

Also:
- Fix a test in bluebird.test.js to be less flaky. In a local run it was
    observed to fail the 'new Promise -> timeout (timed out)' test case
    with:
    `Error: start + 49 should be <= 1683227232779 - was 1683227232782`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants