Skip to content

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martimlobao opened this issue Nov 9, 2021 · 12 comments · Fixed by #4167
Assignees
Labels
Help wanted Extra attention is needed Integration: Apache Spark Triaged Has been looked at recently during old issue triage

Comments

@martimlobao
Copy link

martimlobao commented Nov 9, 2021

Environment

How do you use Sentry?
Sentry SaaS (sentry.io)

Which SDK and version?
[email protected] using spark integration

Steps to Reproduce

This is essentially a MWE of what our setup looks like:

from pyspark import SparkConf
from pyspark.sql import SparkSession

import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

sentry_sdk.init(SENTRY_DSN, integrations=[SparkIntegration()])

def get_spark_context(job_name):
    conf = SparkConf().setAppName(job_name)
    conf = conf.set("spark.python.use.daemon", True)
    conf = conf.set("spark.python.daemon.module", "sentry_daemon")
    session = SparkSession.builder.config(conf=conf).getOrCreate()
    session.sparkContext.addPyFile(".../sentry_daemon.py")
    return session.sparkContext

sc = get_spark_context("my_job")

for batch in batches:
    sc.textFile(batch.input_path).map(some_function).saveAsTextFile(batch.output_path)

sc.stop()
  1. I'm able to get sentry to log exceptions properly using the sentry daemon and the above configuration.
  2. However, I noticed that each batch took progressively longer: without spark integration in sentry, each batch takes ~3 hours to run, but with the integration enabled, the first took 3 hours, the second took 6, the third 9, and so on.
  3. I was able to work around the issue by creating and stopping the spark context within each batch instead of having one for the entire loop.
  4. However, now the job eventually fails due to an out-of-memory error after a few batches, even though we have plenty of resources and we have never encountered this issue at this stage in our pipeline before.

Expected Result

The job would run with Spark Sentry integration normally.

Actual Result

The job either takes progressively longer to finish or will eventually run out of memory and fail.

This is the stdout of EMR cluster:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 13796"...

My understanding is that Spark integration is not being actively maintained and is considered somewhat experimental. Any help here would be greatly appreciated, even if just a potential workaround and not an actual fix.

@AbhiPrasad
Copy link
Member

Yeah, I have a feeling this is due to listener we add into the gateway to add breadcrumbs https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L50. I wonder if disabling it will alleviate these problems. Are you on Spark 3?

Unfortunately, we don't have the bandwidth to actively investigate this, but we can help debug in this GH issue.

@martimlobao
Copy link
Author

Yeah, I have a feeling this is due to listener we add into the gateway to add breadcrumbs https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L50. I wonder if disabling it will alleviate these problems. Are you on Spark 3?

Unfortunately, we don't have the bandwidth to actively investigate this, but we can help debug in this GH issue.

Thanks for the reply @AbhiPrasad! We're using 2.4.6, but we've been recently looking into upgrading to Spark 3.

A workaround for now is perfectly fine for me — I just don't know what i need to work around lol

@AbhiPrasad
Copy link
Member

Can you try commenting this out https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L66? We can hide it behind a flag if it's problematic.

@martimlobao
Copy link
Author

Can you try commenting this out https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L66? We can hide it behind a flag if it's problematic.

Ok, i monkey-patched the SparkIntegration class to import from a modified first-party version (it would have been harder to modify the package itself). The job usually takes around 24 hours to finish, so I'll update you then :)

@martimlobao
Copy link
Author

Now that i give this a second thought, I have a suspicion that the issue might be related to the SparkWorkerIntegration spark worker instead — i had a previous attempt where i had enabled the sentry daemon but forgot to enable the SparkIntegration itself in the sentry_sdk.init integrations, and the job still failed due to an OOM error.

@martimlobao
Copy link
Author

@AbhiPrasad Update: our job failed at the same point due to an OOM error again. I suspect the issue might be with the Spark daemon instead (or possibly some settings that we should adjust).

@AbhiPrasad
Copy link
Member

Interesting. Yeah the solution might just be to disable the worker integration and find another way to capture errors in workers.

@martimlobao
Copy link
Author

Interesting. Yeah the solution might just be to disable the worker integration and find another way to capture errors in workers.

Any idea what the fix might look like?

@AbhiPrasad
Copy link
Member

Not sure @martimlobao - don't have the bandwidth to explore further unfortunately. Contributions and investigations welcome though, happy to review or bounce around ideas!

@zafercavdar
Copy link

Experiencing the same issues. Could it be related to one of the sentry-python dependencies which is time-consuming to serialize/deserialize between Java-Python?

@sl0thentr0py sl0thentr0py added the Help wanted Extra attention is needed label Jan 24, 2022
@sentrivana sentrivana added Integration: Apache Spark Triaged Has been looked at recently during old issue triage labels Oct 23, 2023
@s1gr1d s1gr1d removed the Type: Bug label Dec 10, 2024
@antonpirker antonpirker self-assigned this Mar 19, 2025
@antonpirker
Copy link
Member

Hey @martimlobao and @zafercavdar

it has been some time, but I had some time to look at this.

I have now changed the implementation so that in case of an error in an job, only the breadcrumbs since starting that job are added.

Before also breadcrumbs from jobs processed before the current job where added to an eventual error.

I am not an expert in Apache Spark, so could you confirm that breadcrumbs only from the currently running jobs are enough data for debugging?

@martimlobao
Copy link
Author

Hey @antonpirker, thanks for following up and fixing this! I still remember this bug, crazy that it's been close to 4 years already 😅

Unfortunately I'm no longer at the company where this was an issue so I can't test to see if the bug is finally solved, but by your description it sounds exactly like something that was causing the behavior I was experiencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed Integration: Apache Spark Triaged Has been looked at recently during old issue triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants