-
Notifications
You must be signed in to change notification settings - Fork 535
Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, I have a feeling this is due to listener we add into the gateway to add breadcrumbs https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L50. I wonder if disabling it will alleviate these problems. Are you on Spark 3? Unfortunately, we don't have the bandwidth to actively investigate this, but we can help debug in this GH issue. |
Thanks for the reply @AbhiPrasad! We're using A workaround for now is perfectly fine for me — I just don't know what i need to work around lol |
Can you try commenting this out https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L66? We can hide it behind a flag if it's problematic. |
Ok, i monkey-patched the |
Now that i give this a second thought, I have a suspicion that the issue might be related to the |
@AbhiPrasad Update: our job failed at the same point due to an OOM error again. I suspect the issue might be with the Spark daemon instead (or possibly some settings that we should adjust). |
Interesting. Yeah the solution might just be to disable the worker integration and find another way to capture errors in workers. |
Any idea what the fix might look like? |
Not sure @martimlobao - don't have the bandwidth to explore further unfortunately. Contributions and investigations welcome though, happy to review or bounce around ideas! |
Experiencing the same issues. Could it be related to one of the sentry-python dependencies which is time-consuming to serialize/deserialize between Java-Python? |
Hey @martimlobao and @zafercavdar it has been some time, but I had some time to look at this. I have now changed the implementation so that in case of an error in an job, only the breadcrumbs since starting that job are added. Before also breadcrumbs from jobs processed before the current job where added to an eventual error. I am not an expert in Apache Spark, so could you confirm that breadcrumbs only from the currently running jobs are enough data for debugging? |
Hey @antonpirker, thanks for following up and fixing this! I still remember this bug, crazy that it's been close to 4 years already 😅 Unfortunately I'm no longer at the company where this was an issue so I can't test to see if the bug is finally solved, but by your description it sounds exactly like something that was causing the behavior I was experiencing. |
Environment
How do you use Sentry?
Sentry SaaS (sentry.io)
Which SDK and version?
[email protected] using spark integration
Steps to Reproduce
This is essentially a MWE of what our setup looks like:
Expected Result
The job would run with Spark Sentry integration normally.
Actual Result
The job either takes progressively longer to finish or will eventually run out of memory and fail.
This is the
stdout
of EMR cluster:My understanding is that Spark integration is not being actively maintained and is considered somewhat experimental. Any help here would be greatly appreciated, even if just a potential workaround and not an actual fix.
The text was updated successfully, but these errors were encountered: