This is an experimental Swift library to show how to connect to a remote Apache Spark Connect Server and run SQL statements to manipulate remote data.
So far, this library project is tracking the upstream changes like the Apache Spark 4.0.0 RC4 release and Apache Arrow project's Swift-support.
- Apache Spark 4.0.0 RC4 (April 2025)
- Swift 6.0 (2024) or 6.1 (2025)
- gRPC Swift 2.1 (March 2025)
- gRPC Swift Protobuf 1.2 (April 2025)
- gRPC Swift NIO Transport 1.0 (March 2025)
- FlatBuffers v25.2.10 (February 2025)
- Apache Arrow Swift
Create a Swift project.
mkdir SparkConnectSwiftApp
cd SparkConnectSwiftApp
swift package init --name SparkConnectSwiftApp --type executable
Add SparkConnect
package to the dependency like the following
$ cat Package.swift
import PackageDescription
let package = Package(
name: "SparkConnectSwiftApp",
platforms: [
.macOS(.v15)
],
dependencies: [
.package(url: "https://github.com/apache/spark-connect-swift.git", branch: "main")
],
targets: [
.executableTarget(
name: "SparkConnectSwiftApp",
dependencies: [.product(name: "SparkConnect", package: "spark-connect-swift")]
)
]
)
Use SparkSession
of SparkConnect
module in Swift.
$ cat Sources/main.swift
import SparkConnect
let spark = try await SparkSession.builder.getOrCreate()
print("Connected to Apache Spark \(await spark.version) Server")
let statements = [
"DROP TABLE IF EXISTS t",
"CREATE TABLE IF NOT EXISTS t(a INT) USING ORC",
"INSERT INTO t VALUES (1), (2), (3)",
]
for s in statements {
print("EXECUTE: \(s)")
_ = try await spark.sql(s).count()
}
print("SELECT * FROM t")
try await spark.sql("SELECT * FROM t").cache().show()
try await spark.range(10).filter("id % 2 == 0").write.mode("overwrite").orc("/tmp/orc")
try await spark.read.orc("/tmp/orc").show()
await spark.stop()
Run your Swift application.
$ swift run
...
Connected to Apache Spark 4.0.0 Server
EXECUTE: DROP TABLE IF EXISTS t
EXECUTE: CREATE TABLE IF NOT EXISTS t(a INT)
EXECUTE: INSERT INTO t VALUES (1), (2), (3)
SELECT * FROM t
+---+
| a |
+---+
| 2 |
| 1 |
| 3 |
+---+
+----+
| id |
+----+
| 2 |
| 6 |
| 0 |
| 8 |
| 4 |
+----+
You can find this example in the following repository.
This project also provides Spark SQL REPL
. You can run it directly from this repository.
$ swift run
...
Build of product 'SparkSQLRepl' complete! (2.33s)
Connected to Apache Spark 4.0.0 Server
spark-sql (default)> SHOW DATABASES;
+---------+
|namespace|
+---------+
| default|
+---------+
Time taken: 30 ms
spark-sql (default)> CREATE DATABASE db1;
++
||
++
++
Time taken: 31 ms
spark-sql (default)> USE db1;
++
||
++
++
Time taken: 27 ms
spark-sql (db1)> CREATE TABLE t1 AS SELECT * FROM RANGE(10);
++
||
++
++
Time taken: 99 ms
spark-sql (db1)> SELECT * FROM t1;
+---+
| id|
+---+
| 1|
| 5|
| 3|
| 0|
| 6|
| 9|
| 4|
| 8|
| 7|
| 2|
+---+
Time taken: 80 ms
spark-sql (db1)> USE default;
++
||
++
++
Time taken: 26 ms
spark-sql (default)> DROP DATABASE db1 CASCADE;
++
||
++
++
spark-sql (default)> exit;
You can use SPARK_REMOTE
to specify the Spark Connect connection string in order to provide more options.
SPARK_REMOTE=sc://localhost swift run