A PySpark DataSource to seamlessly integrate and read data from Microsoft Graph API, enabling easy access to resources like SharePoint List Items, and more.
-
Entra ID Authentication Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.
-
Automatic Pagination Handling Fetches all paginated data from Microsoft Graph without manual intervention.
-
Dynamic Schema Inference Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.
-
Simple Configuration with .option() Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.
-
Zero External Ingestion Services No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.
-
Extensible Resource Providers Add custom resource providers to support more Microsoft Graph endpoints as needed.
-
Pluggable Architecture Dynamically load resource providers without modifying core logic.
-
Optimized for PySpark Designed to work natively with Spark's DataFrame API for big data processing.
-
Secure by Design Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.
pip install pyspark-msgraph-source
This package uses DefaultAzureCredential.
Ensure you're authenticated:
az login
Or set environment variables:
export AZURE_CLIENT_ID=<your-client-id>
export AZURE_TENANT_ID=<your-tenant-id>
export AZURE_CLIENT_SECRET=<your-client-secret>
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MSGraphExample") \
.getOrCreate()
from pyspark_msgraph_source.core.source import MSGraphDataSource
spark.dataSource.register(MSGraphDataSource)
df = spark.read.format("msgraph") \
.option("resource", "list_items") \
.option("site-id", "<YOUR_SITE_ID>") \
.option("list-id", "<YOUR_LIST_ID>") \
.option("top", 100) \
.option("expand", "fields") \
.load()
df.show()
# with schema
df = spark.read.format("msgraph") \
.option("resource", "list_items") \
.option("site-id", "<YOUR_SITE_ID>") \
.option("list-id", "<YOUR_LIST_ID>") \
.option("top", 100) \
.option("expand", "fields") \
.schema("id string, Title string")
.load()
df.show()
Resource | Description |
---|---|
list_items |
SharePoint List Items |
(more coming soon...) |
Coming soon...
Issue | Solution |
---|---|
ValueError: resource missing |
Add .option("resource", "list_items") |
Empty dataframe | Verify IDs, permissions, and access |
Authentication failures | Check Azure credentials and login status |