PySpark Microsoft Graph Source

A PySpark DataSource to seamlessly integrate and read data from Microsoft Graph API, enabling easy access to resources like SharePoint List Items, and more.

Features

Entra ID Authentication Securely authenticate with Microsoft Graph using DefaultAzureCredential, supporting local development and production seamlessly.
Automatic Pagination Handling Fetches all paginated data from Microsoft Graph without manual intervention.
Dynamic Schema Inference Automatically detects the schema of the resource by sampling data, so you don't need to define it manually.
Simple Configuration with .option() Easily configure resources and query parameters directly in your Spark read options, making it flexible and intuitive.
Zero External Ingestion Services No additional services like Azure Data Factory or Logic Apps are needed—directly ingest data into Spark from Microsoft Graph.
Extensible Resource Providers Add custom resource providers to support more Microsoft Graph endpoints as needed.
Pluggable Architecture Dynamically load resource providers without modifying core logic.
Optimized for PySpark Designed to work natively with Spark's DataFrame API for big data processing.
Secure by Design Credentials and secrets are handled using Azure Identity best practices, avoiding hardcoding sensitive data.

Installation

pip install pyspark-msgraph-source

⚡ Quickstart

1. Authentication

This package uses DefaultAzureCredential.
Ensure you're authenticated:

az login

Or set environment variables:

export AZURE_CLIENT_ID=<your-client-id>
export AZURE_TENANT_ID=<your-tenant-id>
export AZURE_CLIENT_SECRET=<your-client-secret>

2. Example Usage

from pyspark.sql import SparkSession

spark = SparkSession.builder \ 
.appName("MSGraphExample") \ 
.getOrCreate()

from pyspark_msgraph_source.core.source import MSGraphDataSource
spark.dataSource.register(MSGraphDataSource)

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.load()

df.show()

# with schema

df = spark.read.format("msgraph") \ 
.option("resource", "list_items") \ 
.option("site-id", "<YOUR_SITE_ID>") \ 
.option("list-id", "<YOUR_LIST_ID>") \ 
.option("top", 100) \ 
.option("expand", "fields") \ 
.schema("id string, Title string")
.load()

df.show()

Supported Resources

Resource	Description
`list_items`	SharePoint List Items
(more coming soon...)

Development

Coming soon...

Troubleshooting

Issue	Solution
`ValueError: resource missing`	Add `.option("resource", "list_items")`
Empty dataframe	Verify IDs, permissions, and access
Authentication failures	Check Azure credentials and login status

📄 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
src/pyspark_msgraph_source		src/pyspark_msgraph_source
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Microsoft Graph Source

Features

Installation

⚡ Quickstart

1. Authentication

2. Example Usage

Supported Resources

Development

Troubleshooting

📄 License

📚 Resources

About

Releases

Packages

Languages

License

geekwhocodes/pyspark-msgraph-source

Folders and files

Latest commit

History

Repository files navigation

PySpark Microsoft Graph Source

Features

Installation

⚡ Quickstart

1. Authentication

2. Example Usage

Supported Resources

Development

Troubleshooting

📄 License

📚 Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages