Skip to content

Refactor BatchLogRecordProcessor and associated tests #4535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 24, 2025

Conversation

DylanRussell
Copy link
Contributor

@DylanRussell DylanRussell commented Apr 9, 2025

Description

Refactor BatchLogRecordProcessor, keeping the existing behavior mostly the same. This PR cleans up the code, including the tests, and also adds some new tests.

One exception is forceFlush which now calls export synchronously from the main thread and waits for it to finish.

Previously forceFlush would wait timeout_millis for the worker thread to make and finish an export call, and if an export call was in progress it would wait for the subsequent export call to finish. It would return true if this export call completed in time and false otherwise. It didn't cancel the request after timeout, it just stopped waiting for it to finish.

I think ideally forceFlush.timeout_millis (and also shutdown.timeout_millis) should be used as the time after which the export call(s) gets cancelled. But for that to work we need to be able to pass a timeout to export like what was proposed in #4183. Until then I think we should ignore it and document that it doesn't work.

I'm not sure what forceFlush should return, currently I have it return nothing (same as javascript. It could always return True, to signify that export was called until the queue was empty. It could return True if all export calls succeeded, and False otherwise, and it could stop flushing after the first failed export, like how go lang does it.

I think my proposed behavior is more inline with the spec too.

Note that the default for forceFlush.timeout_millis came from the OTEL_BLRP_EXPORT_TIMEOUT environment variable which is supposed to configure "the maximum allowed time to export data from the BatchLogRecordProcessor". I propose we leave this env var unused for now, and document that it doesn't do anything. This flag seems redundant with the OTLP Exporter timeout env vars anyway. Maybe in other languages the BatchLogRecordProcessor isn't the default one used for auto instrumentation, so it makes more sense for it to be configurable ?

Type of change

Please delete options that are not relevant.Please delete options that are not relevant.

  • [ X] Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added lots of unit tests.

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • [ x] No.

Checklist:

  • [x ] Followed the style guidelines of this project
  • Changelogs have been updated
  • [x ] Unit tests have been added
  • Documentation has been updated

@DylanRussell DylanRussell requested a review from a team as a code owner April 9, 2025 20:06
@aabmass
Copy link
Member

aabmass commented Apr 16, 2025

This flag seems redundant with the OTLP Exporter timeout env vars anyway. Maybe in other languages the BatchLogRecordProcessor isn't the default one used for auto instrumentation, so it makes more sense for it to be configurable ?

The OTEL_BLRP_EXPORT_TIMEOUT should work with all exporters, not just OTLP. I think the intention of having a separate one for OTLP is to specifically target OTLP exporters in case there are multiple BLRP instances. It's definitely a bit clunky though.

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a huge improvement to the complexity of the threading code 😃

I'd like to get some more eyes on this since it concurrency bugs can be really subtle

@aabmass
Copy link
Member

aabmass commented Apr 16, 2025

I think the failing Windows run is pretty typical of what we see with sleep() in tests: https://github.com/open-telemetry/opentelemetry-python/actions/runs/14366019090/job/40279304137?pr=4535. It might pass on a future run, but please try to improve the flakiness if you can

Copy link
Member

@pmcollins pmcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for these important changes!

My only feedback here is maybe it would also be helpful to try to eventually factor some classes out. For example, self._queue and self._queue_lock are often used together and perhaps could be in their own class. Also, more generally, we're doing batching for spans and logs -- could we use one generic batcher that could handle both signals?

@DylanRussell
Copy link
Contributor Author

Added a buffer to that test that flaked, thanks for point that out. Hopefully it passes this time

@DylanRussell
Copy link
Contributor Author

My only feedback here is maybe it would also be helpful to try to eventually factor some classes out. For example, self._queue and self._queue_lock are often used together and perhaps could be in their own class. Also, more generally, we're doing batching for spans and logs -- could we use one generic batcher that could handle both signals?

Sounds good ! I will look into this. I was planning to fix the BatchSpanProcessor code which works the exact same way, so some generic batch class makes a lot of sense. I think I'll do that in a separate PR tho, this one already getting big

@DylanRussell
Copy link
Contributor Author

Can someone add the Skip Changelog tag ? I don't think this needs a changelog, since it's basically just a refactor and not changing behavior

@DylanRussell
Copy link
Contributor Author

Alright I think this is good to merge, just need the Skip Changelog tag and then for someone to push it

Copy link
Contributor

@lzchen lzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner than before thanks!

@lzchen lzchen added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Apr 23, 2025
@lzchen lzchen merged commit 00329e0 into open-telemetry:main Apr 24, 2025
477 of 481 checks passed
DylanRussell added a commit to DylanRussell/opentelemetry-python that referenced this pull request Apr 29, 2025
DylanRussell added a commit to DylanRussell/opentelemetry-python that referenced this pull request Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants