Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

martimlobao · 2021-11-09T15:29:53Z

Environment

How do you use Sentry?
Sentry SaaS (sentry.io)

Which SDK and version?
[email protected] using spark integration

Steps to Reproduce

This is essentially a MWE of what our setup looks like:

from pyspark import SparkConf
from pyspark.sql import SparkSession

import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

sentry_sdk.init(SENTRY_DSN, integrations=[SparkIntegration()])

def get_spark_context(job_name):
    conf = SparkConf().setAppName(job_name)
    conf = conf.set("spark.python.use.daemon", True)
    conf = conf.set("spark.python.daemon.module", "sentry_daemon")
    session = SparkSession.builder.config(conf=conf).getOrCreate()
    session.sparkContext.addPyFile(".../sentry_daemon.py")
    return session.sparkContext

sc = get_spark_context("my_job")

for batch in batches:
    sc.textFile(batch.input_path).map(some_function).saveAsTextFile(batch.output_path)

sc.stop()

I'm able to get sentry to log exceptions properly using the sentry daemon and the above configuration.
However, I noticed that each batch took progressively longer: without spark integration in sentry, each batch takes ~3 hours to run, but with the integration enabled, the first took 3 hours, the second took 6, the third 9, and so on.
I was able to work around the issue by creating and stopping the spark context within each batch instead of having one for the entire loop.
However, now the job eventually fails due to an out-of-memory error after a few batches, even though we have plenty of resources and we have never encountered this issue at this stage in our pipeline before.

Expected Result

The job would run with Spark Sentry integration normally.

Actual Result

The job either takes progressively longer to finish or will eventually run out of memory and fail.

This is the stdout of EMR cluster:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 13796"...

My understanding is that Spark integration is not being actively maintained and is considered somewhat experimental. Any help here would be greatly appreciated, even if just a potential workaround and not an actual fix.

The text was updated successfully, but these errors were encountered:

AbhiPrasad · 2021-11-16T18:46:12Z

Yeah, I have a feeling this is due to listener we add into the gateway to add breadcrumbs https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L50. I wonder if disabling it will alleviate these problems. Are you on Spark 3?

Unfortunately, we don't have the bandwidth to actively investigate this, but we can help debug in this GH issue.

martimlobao · 2021-11-16T19:18:49Z

Yeah, I have a feeling this is due to listener we add into the gateway to add breadcrumbs https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L50. I wonder if disabling it will alleviate these problems. Are you on Spark 3?

Unfortunately, we don't have the bandwidth to actively investigate this, but we can help debug in this GH issue.

Thanks for the reply @AbhiPrasad! We're using 2.4.6, but we've been recently looking into upgrading to Spark 3.

A workaround for now is perfectly fine for me — I just don't know what i need to work around lol

AbhiPrasad · 2021-11-16T19:41:34Z

Can you try commenting this out https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L66? We can hide it behind a flag if it's problematic.

martimlobao · 2021-11-17T04:42:31Z

Can you try commenting this out https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_driver.py#L66? We can hide it behind a flag if it's problematic.

Ok, i monkey-patched the SparkIntegration class to import from a modified first-party version (it would have been harder to modify the package itself). The job usually takes around 24 hours to finish, so I'll update you then :)

martimlobao · 2021-11-17T04:45:59Z

Now that i give this a second thought, I have a suspicion that the issue might be related to the SparkWorkerIntegration spark worker instead — i had a previous attempt where i had enabled the sentry daemon but forgot to enable the SparkIntegration itself in the sentry_sdk.init integrations, and the job still failed due to an OOM error.

martimlobao · 2021-11-19T03:53:47Z

@AbhiPrasad Update: our job failed at the same point due to an OOM error again. I suspect the issue might be with the Spark daemon instead (or possibly some settings that we should adjust).

AbhiPrasad · 2021-11-19T13:17:35Z

Interesting. Yeah the solution might just be to disable the worker integration and find another way to capture errors in workers.

martimlobao · 2021-11-23T16:13:30Z

Interesting. Yeah the solution might just be to disable the worker integration and find another way to capture errors in workers.

Any idea what the fix might look like?

AbhiPrasad · 2021-11-24T17:57:31Z

Not sure @martimlobao - don't have the bandwidth to explore further unfortunately. Contributions and investigations welcome though, happy to review or bounce around ideas!

zafercavdar · 2022-01-23T22:17:54Z

Experiencing the same issues. Could it be related to one of the sentry-python dependencies which is time-consuming to serialize/deserialize between Java-Python?

antonpirker · 2025-03-19T16:44:01Z

Hey @martimlobao and @zafercavdar

it has been some time, but I had some time to look at this.

I have now changed the implementation so that in case of an error in an job, only the breadcrumbs since starting that job are added.

Before also breadcrumbs from jobs processed before the current job where added to an eventual error.

I am not an expert in Apache Spark, so could you confirm that breadcrumbs only from the currently running jobs are enough data for debugging?

martimlobao · 2025-03-20T13:09:43Z

Hey @antonpirker, thanks for following up and fixing this! I still remember this bug, crazy that it's been close to 4 years already 😅

Unfortunately I'm no longer at the company where this was an issue so I can't test to see if the bug is finally solved, but by your description it sounds exactly like something that was causing the behavior I was experiencing.

getsantry bot added the Status: Untriaged label Nov 9, 2021

AbhiPrasad added bug labels Nov 16, 2021

getsantry bot removed the Status: Untriaged label Nov 16, 2021

sl0thentr0py added the Help wanted Extra attention is needed label Jan 24, 2022

hubertdeng123 removed the Status: Backlog label Jul 25, 2023

sentrivana added Integration: Apache Spark Triaged Has been looked at recently during old issue triage labels Oct 23, 2023

s1gr1d removed the Type: Bug label Dec 10, 2024

antonpirker self-assigned this Mar 19, 2025

antonpirker mentioned this issue Mar 19, 2025

Fix memory leak by not piling up breadcrumbs forever in Spark workers. #4167

Merged

antonpirker closed this as completed in #4167 Mar 20, 2025

antonpirker closed this as completed in 5715734 Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

martimlobao commented Nov 9, 2021 •

edited

Loading

AbhiPrasad commented Nov 16, 2021

martimlobao commented Nov 16, 2021

AbhiPrasad commented Nov 16, 2021

martimlobao commented Nov 17, 2021

martimlobao commented Nov 17, 2021

martimlobao commented Nov 19, 2021

AbhiPrasad commented Nov 19, 2021

martimlobao commented Nov 23, 2021

AbhiPrasad commented Nov 24, 2021

zafercavdar commented Jan 23, 2022

antonpirker commented Mar 19, 2025

martimlobao commented Mar 20, 2025

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

Spark integration causes significant slowdowns or even the entire job to run out of memory and fail #1245

Comments

martimlobao commented Nov 9, 2021 • edited Loading

Environment

Steps to Reproduce

Expected Result

Actual Result

AbhiPrasad commented Nov 16, 2021

martimlobao commented Nov 16, 2021

AbhiPrasad commented Nov 16, 2021

martimlobao commented Nov 17, 2021

martimlobao commented Nov 17, 2021

martimlobao commented Nov 19, 2021

AbhiPrasad commented Nov 19, 2021

martimlobao commented Nov 23, 2021

AbhiPrasad commented Nov 24, 2021

zafercavdar commented Jan 23, 2022

antonpirker commented Mar 19, 2025

martimlobao commented Mar 20, 2025

martimlobao commented Nov 9, 2021 •

edited

Loading