Skip to content

Commit 4c5d502

Browse files
committed
Merge remote-tracking branch 'upstream/master'
2 parents e4b81af + 61cfeb5 commit 4c5d502

File tree

133 files changed

+5520
-1254
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

133 files changed

+5520
-1254
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -367,3 +367,6 @@ hs_err_pid*
367367

368368
# The target folder contains the output of building
369369
**/target/**
370+
371+
# F# vs code
372+
.ionide/

NuGet.config

+2
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,7 @@
66
<add key="dotnet-core" value="https://dotnetfeed.blob.core.windows.net/dotnet-core/index.json" />
77
<add key="dotnet-tools" value="https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-tools/nuget/v3/index.json" />
88
<add key="dotnet-eng" value="https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-eng/nuget/v3/index.json" />
9+
<add key="dotnet5" value="https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet5/nuget/v3/index.json" />
10+
<add key="dotnet-try" value="https://dotnet.myget.org/F/dotnet-try/api/v3/index.json" />
911
</packageSources>
1012
</configuration>

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
<tbody align="center">
4040
<tr>
4141
<td >2.3.*</td>
42-
<td rowspan=6><a href="https://github.com/dotnet/spark/releases/tag/v0.11.0">v0.11.0</a></td>
42+
<td rowspan=6><a href="https://github.com/dotnet/spark/releases/tag/v0.12.1">v0.12.1</a></td>
4343
</tr>
4444
<tr>
4545
<td>2.4.0</td>

azure-pipelines.yml

+485-410
Large diffs are not rendered by default.

benchmark/scala/pom.xml

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
<modelVersion>4.0.0</modelVersion>
44
<groupId>com.microsoft.spark</groupId>
55
<artifactId>microsoft-spark-benchmark</artifactId>
6-
<version>0.11.0</version>
6+
<version>0.12.1</version>
77
<inceptionYear>2019</inceptionYear>
88
<properties>
99
<encoding>UTF-8</encoding>

docs/broadcast-guide.md

+92
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Guide to using Broadcast Variables
2+
3+
This is a guide to show how to use broadcast variables in .NET for Apache Spark.
4+
5+
## What are Broadcast Variables
6+
7+
[Broadcast variables in Apache Spark](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) are a mechanism for sharing variables across executors that are meant to be read-only. They allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
8+
9+
### How to use broadcast variables in .NET for Apache Spark
10+
11+
Broadcast variables are created from a variable `v` by calling `SparkContext.Broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `Value()` method.
12+
13+
Example:
14+
15+
```csharp
16+
string v = "Variable to be broadcasted";
17+
Broadcast<string> bv = SparkContext.Broadcast(v);
18+
19+
// Using the broadcast variable in a UDF:
20+
Func<Column, Column> udf = Udf<string, string>(
21+
str => $"{str}: {bv.Value()}");
22+
```
23+
24+
The type parameter for `Broadcast` should be the type of the variable being broadcasted.
25+
26+
### Deleting broadcast variables
27+
28+
The broadcast variable can be deleted from all executors by calling the `Destroy()` method on it.
29+
30+
```csharp
31+
// Destroying the broadcast variable bv:
32+
bv.Destroy();
33+
```
34+
35+
> Note: `Destroy()` deletes all data and metadata related to the broadcast variable. Use this with caution - once a broadcast variable has been destroyed, it cannot be used again.
36+
37+
#### Caveat of using Destroy
38+
39+
One important thing to keep in mind while using broadcast variables in UDFs is to limit the scope of the variable to only the UDF that is referencing it. The [guide to using UDFs](udf-guide.md) describes this phenomenon in detail. This is especially crucial when calling `Destroy` on the broadcast variable. If the broadcast variable that has been destroyed is visible to or accessible from other UDFs, it gets picked up for serialization by all those UDFs, even if it is not being referenced by them. This will throw an error as .NET for Apache Spark is not able to serialize the destroyed broadcast variable.
40+
41+
Example to demonstrate:
42+
43+
```csharp
44+
string v = "Variable to be broadcasted";
45+
Broadcast<string> bv = SparkContext.Broadcast(v);
46+
47+
// Using the broadcast variable in a UDF:
48+
Func<Column, Column> udf1 = Udf<string, string>(
49+
str => $"{str}: {bv.Value()}");
50+
51+
// Destroying bv
52+
bv.Destroy();
53+
54+
// Calling udf1 after destroying bv throws the following expected exception:
55+
// org.apache.spark.SparkException: Attempted to use Broadcast(0) after it was destroyed
56+
df.Select(udf1(df["_1"])).Show();
57+
58+
// Different UDF udf2 that is not referencing bv
59+
Func<Column, Column> udf2 = Udf<string, string>(
60+
str => $"{str}: not referencing broadcast variable");
61+
62+
// Calling udf2 throws the following (unexpected) exception:
63+
// [Error] [JvmBridge] org.apache.spark.SparkException: Task not serializable
64+
df.Select(udf2(df["_1"])).Show();
65+
```
66+
67+
The recommended way of implementing above desired behavior:
68+
69+
```csharp
70+
string v = "Variable to be broadcasted";
71+
// Restricting the visibility of bv to only the UDF referencing it
72+
{
73+
Broadcast<string> bv = SparkContext.Broadcast(v);
74+
75+
// Using the broadcast variable in a UDF:
76+
Func<Column, Column> udf1 = Udf<string, string>(
77+
str => $"{str}: {bv.Value()}");
78+
79+
// Destroying bv
80+
bv.Destroy();
81+
}
82+
83+
// Different UDF udf2 that is not referencing bv
84+
Func<Column, Column> udf2 = Udf<string, string>(
85+
str => $"{str}: not referencing broadcast variable");
86+
87+
// Calling udf2 works fine as expected
88+
df.Select(udf2(df["_1"])).Show();
89+
```
90+
This ensures that destroying `bv` doesn't affect calling `udf2` because of unexpected serialization behavior.
91+
92+
Broadcast variables are useful for transmitting read-only data to all executors, as the data is sent only once and this can give performance benefits when compared with using local variables that get shipped to the executors with each task. Please refer to the [official documentation](https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html#broadcast-variables) to get a deeper understanding of broadcast variables and why they are used.

docs/building/ubuntu-instructions.md

+13-13
Original file line numberDiff line numberDiff line change
@@ -35,14 +35,14 @@ If you already have all the pre-requisites, skip to the [build](ubuntu-instructi
3535
```bash
3636
sudo update-alternatives --config java
3737
```
38-
3. Install **[Apache Maven 3.6.0+](https://maven.apache.org/download.cgi)**
38+
3. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)**
3939
- Run the following command:
4040
```bash
4141
mkdir -p ~/bin/maven
4242
cd ~/bin/maven
43-
wget https://www-us.apache.org/dist/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.tar.gz
44-
tar -xvzf apache-maven-3.6.0-bin.tar.gz
45-
ln -s apache-maven-3.6.0 current
43+
wget https://www-us.apache.org/dist/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.tar.gz
44+
tar -xvzf apache-maven-3.6.3-bin.tar.gz
45+
ln -s apache-maven-3.6.3 current
4646
export M2_HOME=~/bin/maven/current
4747
export PATH=${M2_HOME}/bin:${PATH}
4848
source ~/.bashrc
@@ -54,11 +54,11 @@ If you already have all the pre-requisites, skip to the [build](ubuntu-instructi
5454
<summary>&#x1F4D9; Click to see sample mvn -version output</summary>
5555

5656
```
57-
Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 2018-10-24T18:41:47Z)
58-
Maven home: ~/bin/apache-maven-3.6.0
59-
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
60-
Default locale: en, platform encoding: UTF-8
61-
OS name: "linux", version: "4.4.0-17763-microsoft", arch: "amd64", family: "unix"
57+
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
58+
Maven home: ~/bin/apache-maven-3.6.3
59+
Java version: 1.8.0_242, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
60+
Default locale: en_US, platform encoding: ANSI_X3.4-1968
61+
OS name: "linux", version: "4.4.0-142-generic", arch: "amd64", family: "unix"
6262
```
6363
4. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)**
6464
- Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `~/bin/spark-2.3.2-bin-hadoop2.7`)
@@ -185,15 +185,15 @@ Once you build the samples, you can use `spark-submit` to submit your .NET Core
185185
--class org.apache.spark.deploy.dotnet.DotnetRunner \
186186
--master local \
187187
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
188-
Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
188+
./Microsoft.Spark.CSharp.Examples Sql.Batch.Basic $SPARK_HOME/examples/src/main/resources/people.json
189189
```
190190
- **[Microsoft.Spark.Examples.Sql.Streaming.StructuredNetworkWordCount](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredNetworkWordCount.cs)**
191191
```bash
192192
spark-submit \
193193
--class org.apache.spark.deploy.dotnet.DotnetRunner \
194194
--master local \
195195
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
196-
Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
196+
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredNetworkWordCount localhost 9999
197197
```
198198
- **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (maven accessible)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
199199
```bash
@@ -202,7 +202,7 @@ Once you build the samples, you can use `spark-submit` to submit your .NET Core
202202
--class org.apache.spark.deploy.dotnet.DotnetRunner \
203203
--master local \
204204
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
205-
Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
205+
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
206206
```
207207
- **[Microsoft.Spark.Examples.Sql.Streaming.StructuredKafkaWordCount (jars provided)](../../examples/Microsoft.Spark.CSharp.Examples/Sql/Streaming/StructuredKafkaWordCount.cs)**
208208
```bash
@@ -211,7 +211,7 @@ Once you build the samples, you can use `spark-submit` to submit your .NET Core
211211
--class org.apache.spark.deploy.dotnet.DotnetRunner \
212212
--master local \
213213
~/dotnet.spark/src/scala/microsoft-spark-<version>/target/microsoft-spark-<version>.jar \
214-
Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
214+
./Microsoft.Spark.CSharp.Examples Sql.Streaming.StructuredKafkaWordCount localhost:9092 subscribe test
215215
```
216216
217217
Feel this experience is complicated? Help us by taking up [Simplify User Experience for Running an App](https://github.com/dotnet/spark/issues/6)

docs/building/windows-instructions.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ If you already have all the pre-requisites, skip to the [build](windows-instruct
3030
3. Install **[Java 1.8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html)**
3131
- Select the appropriate version for your operating system e.g., jdk-8u201-windows-x64.exe for Win x64 machine.
3232
- Install using the installer and verify you are able to run `java` from your command-line
33-
4. Install **[Apache Maven 3.6.0+](https://maven.apache.org/download.cgi)**
34-
- Download [Apache Maven 3.6.0](http://mirror.metrocast.net/apache/maven/maven-3/3.6.0/binaries/apache-maven-3.6.0-bin.zip)
35-
- Extract to a local directory e.g., `c:\bin\apache-maven-3.6.0\`
36-
- Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.0\bin`
33+
4. Install **[Apache Maven 3.6.3+](https://maven.apache.org/download.cgi)**
34+
- Download [Apache Maven 3.6.3](http://mirror.metrocast.net/apache/maven/maven-3/3.6.3/binaries/apache-maven-3.6.3-bin.zip)
35+
- Extract to a local directory e.g., `c:\bin\apache-maven-3.6.3\`
36+
- Add Apache Maven to your [PATH environment variable](https://www.java.com/en/download/help/path.xml) e.g., `c:\bin\apache-maven-3.6.3\bin`
3737
- Verify you are able to run `mvn` from your command-line
3838
5. Install **[Apache Spark 2.3+](https://spark.apache.org/downloads.html)**
3939
- Download [Apache Spark 2.3+](https://spark.apache.org/downloads.html) and extract it into a local folder (e.g., `c:\bin\spark-2.3.2-bin-hadoop2.7\`) using [7-zip](https://www.7-zip.org/).
+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# .NET for Apache Spark 0.12.1 Release Notes
2+
3+
### New Features/Improvements
4+
5+
* Expose `JvmException` to capture JVM error messages separately ([#566](https://github.com/dotnet/spark/pull/566))
6+
7+
### Bug Fixes
8+
9+
* AssemblyLoader should use absolute assembly path when loading assemblies ([570](https://github.com/dotnet/spark/pull/570))
10+
11+
### Infrastructure / Documentation / Etc.
12+
13+
* None
14+
15+
### Breaking Changes
16+
17+
* None
18+
19+
### Known Issues
20+
21+
* Broadcast variables do not work with [dotnet-interactive](https://github.com/dotnet/interactive) ([#561](https://github.com/dotnet/spark/pull/561))
22+
23+
### Compatibility
24+
25+
#### Backward compatibility
26+
27+
The following table describes the oldest version of the worker that the current version is compatible with, along with new features that are incompatible with the worker.
28+
29+
<table>
30+
<thead>
31+
<tr>
32+
<th>Oldest compatible Microsoft.Spark.Worker version</th>
33+
<th>Incompatible features</th>
34+
</tr>
35+
</thead>
36+
<tbody align="center">
37+
<tr>
38+
<td rowspan=4>v0.9.0</td>
39+
<td>DataFrame with Grouped Map UDF <a href="https://github.com/dotnet/spark/pull/277">(#277)</a></td>
40+
</tr>
41+
<tr>
42+
<td>DataFrame with Vector UDF <a href="https://github.com/dotnet/spark/pull/277">(#277)</a></td>
43+
</tr>
44+
<tr>
45+
<td>Support for Broadcast Variables <a href="https://github.com/dotnet/spark/pull/414">(#414)</a></td>
46+
</tr>
47+
<tr>
48+
<td>Support for TimestampType <a href="https://github.com/dotnet/spark/pull/428">(#428)</a></td>
49+
</tr>
50+
</tbody>
51+
</table>
52+
53+
#### Forward compatibility
54+
55+
The following table describes the oldest version of .NET for Apache Spark release that the current worker is compatible with.
56+
57+
<table>
58+
<thead>
59+
<tr>
60+
<th>Oldest compatible .NET for Apache Spark release version</th>
61+
</tr>
62+
</thead>
63+
<tbody align="center">
64+
<tr>
65+
<td>v0.9.0</td>
66+
</tr>
67+
</tbody>
68+
</table>
69+
70+
### Supported Spark Versions
71+
72+
The following table outlines the supported Spark versions along with the microsoft-spark JAR to use with:
73+
74+
<table>
75+
<thead>
76+
<tr>
77+
<th>Spark Version</th>
78+
<th>microsoft-spark JAR</th>
79+
</tr>
80+
</thead>
81+
<tbody align="center">
82+
<tr>
83+
<td>2.3.*</td>
84+
<td>microsoft-spark-2.3.x-0.12.1.jar</td>
85+
</tr>
86+
<tr>
87+
<td>2.4.0</td>
88+
<td rowspan=6>microsoft-spark-2.4.x-0.12.1.jar</td>
89+
</tr>
90+
<tr>
91+
<td>2.4.1</td>
92+
</tr>
93+
<tr>
94+
<td>2.4.3</td>
95+
</tr>
96+
<tr>
97+
<td>2.4.4</td>
98+
</tr>
99+
<tr>
100+
<td>2.4.5</td>
101+
</tr>
102+
<tr>
103+
<td>2.4.6</td>
104+
</tr>
105+
<tr>
106+
<td>2.4.2</td>
107+
<td><a href="https://github.com/dotnet/spark/issues/60">Not supported</a></td>
108+
</tr>
109+
</tbody>
110+
</table>

0 commit comments

Comments
 (0)