[SPARK-29341][Python] Upgrade cloudpickle to 1.0.0 #26009

viirya · 2019-10-03T07:50:46Z

What changes were proposed in this pull request?

This patch upgrades cloudpickle to 1.0.0 version.

Main changes:

cleanup unused functions: cloudpipe/cloudpickle@936f16f
Fix relative imports inside function body: cloudpipe/cloudpickle@31ecdd6
Write kw only arguments to pickle: cloudpipe/cloudpickle@6cb4718

Why are the changes needed?

We should include new bug fix like cloudpipe/cloudpickle@6cb4718, because users might use such python function in PySpark.

>>> def f(a, *, b=1):                                                                                                                                                        
...   return a + b                                                                                                                                                           
...                                                                                      
>>> rdd = sc.parallelize([1, 2, 3])                                                                                                                                         
>>> rdd.map(f).collect()                                                                                                                                                     
[Stage 0:>                                                        (0 + 12) / 12]19/10/03 00:42:24 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)                
org.apache.spark.api.python.PythonException: Traceback (most recent call last):                          
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 598, in main                 
    process()                                                                                                                                                                
  File "/spark/python/lib/pyspark.zip/pyspark/worker.py", line 590, in process                                                                         
    serializer.dump_stream(out_iter, outfile)                                                                                                                               
  File "/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 513, in dump_stream                                                                
    vs = list(itertools.islice(iterator, batch))                                                                                                                            
  File "/spark/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper                                                                            
    return f(*args, **kwargs)                                                             
TypeError: f() missing 1 required keyword-only argument: 'b'

After:

>>> def f(a, *, b=1):                                                                                                                                                       
...   return a + b                                                                                                                                                           
...                                                                                                                                                                          
>>> rdd = sc.parallelize([1, 2, 3])                                                                                                                                          
>>> rdd.map(f).collect()                                                                              
[2, 3, 4]

Does this PR introduce any user-facing change?

Yes. This fixes two bugs when pickling Python functions.

How was this patch tested?

Existing tests.

SparkQA · 2019-10-03T08:30:44Z

Test build #111726 has finished for PR 26009 at commit 496660b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-10-03T09:18:11Z

cc @HyukjinKwon

HyukjinKwon · 2019-10-03T10:20:33Z

Merged to master.

Upgrade cloudpickle.

496660b

dongjoon-hyun added the PYSPARK label Oct 3, 2019

dongjoon-hyun approved these changes Oct 3, 2019

View reviewed changes

HyukjinKwon approved these changes Oct 3, 2019

View reviewed changes

HyukjinKwon closed this in 2bc3fff Oct 3, 2019

m-aciek mentioned this pull request Jul 5, 2022

PicklingError: Could not serialize object: TypeError: can't pickle _abc_data objects Affirm/shparkley#5

Open

viirya deleted the upgrade-cloudpickle branch December 27, 2023 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29341][Python] Upgrade cloudpickle to 1.0.0 #26009

[SPARK-29341][Python] Upgrade cloudpickle to 1.0.0 #26009

Uh oh!

viirya commented Oct 3, 2019

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

dongjoon-hyun commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

Uh oh!

[SPARK-29341][Python] Upgrade cloudpickle to 1.0.0 #26009

[SPARK-29341][Python] Upgrade cloudpickle to 1.0.0 #26009

Uh oh!

Conversation

viirya commented Oct 3, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 3, 2019

Uh oh!

dongjoon-hyun commented Oct 3, 2019

Uh oh!

HyukjinKwon commented Oct 3, 2019

Uh oh!

Uh oh!