Skip to content

Commit deb1f25

Browse files
authored
Documentation updates: simplify examples and add section on data sources (#955)
* Add a simple example to the introduction page to demonstrate loading a dataframe from a csv file and displaying the contents * Update basics doc to be a little more straight forward * Move downloading of data files for examples into the build scripts and just point the users to where these files are located instead of adding url lib requests to the python examples so we can focus on what is most important to the user * Handle a few errors generated by doc site builder * Switch example so that there is not confusion about the single and double quotes due to capitalization * Add section on data sources * Build pipeline doesn't have polars and it isn't really necessary for the example, so swith to a code block instead of ipython directive
1 parent 54e5e0d commit deb1f25

20 files changed

+300
-87
lines changed

.github/workflows/docs.yaml

+2
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,8 @@ jobs:
7575
set -x
7676
source venv/bin/activate
7777
cd docs
78+
curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
79+
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
7880
make html
7981
8082
- name: Copy & push the generated HTML

docs/.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
pokemon.csv
22
yellow_trip_data.parquet
3+
yellow_tripdata_2021-01.parquet
4+

docs/build.sh

+10-1
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,17 @@
1919
#
2020

2121
set -e
22+
23+
if [ ! -f pokemon.csv ]; then
24+
curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv
25+
fi
26+
27+
if [ ! -f yellow_tripdata_2021-01.parquet ]; then
28+
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet
29+
fi
30+
2231
rm -rf build 2> /dev/null
2332
rm -rf temp 2> /dev/null
2433
mkdir temp
2534
cp -rf source/* temp/
26-
make SOURCEDIR=`pwd`/temp html
35+
make SOURCEDIR=`pwd`/temp html
147 KB
Loading

docs/source/index.rst

+6-19
Original file line numberDiff line numberDiff line change
@@ -43,27 +43,13 @@ Example
4343

4444
.. ipython:: python
4545
46-
import datafusion
47-
from datafusion import col
48-
import pyarrow
46+
from datafusion import SessionContext
4947
50-
# create a context
51-
ctx = datafusion.SessionContext()
48+
ctx = SessionContext()
5249
53-
# create a RecordBatch and a new DataFrame from it
54-
batch = pyarrow.RecordBatch.from_arrays(
55-
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
56-
names=["a", "b"],
57-
)
58-
df = ctx.create_dataframe([[batch]], name="batch_array")
50+
df = ctx.read_csv("pokemon.csv")
5951
60-
# create a new statement
61-
df = df.select(
62-
col("a") + col("b"),
63-
col("a") - col("b"),
64-
)
65-
66-
df
52+
df.show()
6753
6854
6955
.. _toc.links:
@@ -85,9 +71,10 @@ Example
8571

8672
user-guide/introduction
8773
user-guide/basics
88-
user-guide/configuration
74+
user-guide/data-sources
8975
user-guide/common-operations/index
9076
user-guide/io/index
77+
user-guide/configuration
9178
user-guide/sql
9279

9380

docs/source/user-guide/basics.rst

+39-35
Original file line numberDiff line numberDiff line change
@@ -20,72 +20,76 @@
2020
Concepts
2121
========
2222

23-
In this section, we will cover a basic example to introduce a few key concepts.
23+
In this section, we will cover a basic example to introduce a few key concepts. We will use the same
24+
source file as described in the :ref:`Introduction <guide>`, the Pokemon data set.
2425

25-
.. code-block:: python
26+
.. ipython:: python
2627
27-
import datafusion
28-
from datafusion import col
29-
import pyarrow
28+
from datafusion import SessionContext, col, lit, functions as f
3029
31-
# create a context
32-
ctx = datafusion.SessionContext()
30+
ctx = SessionContext()
3331
34-
# create a RecordBatch and a new DataFrame from it
35-
batch = pyarrow.RecordBatch.from_arrays(
36-
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
37-
names=["a", "b"],
38-
)
39-
df = ctx.create_dataframe([[batch]])
32+
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
4033
41-
# create a new statement
4234
df = df.select(
43-
col("a") + col("b"),
44-
col("a") - col("b"),
35+
"trip_distance",
36+
col("total_amount").alias("total"),
37+
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
4538
)
4639
47-
# execute and collect the first (and only) batch
48-
result = df.collect()[0]
40+
df.show()
4941
50-
The first statement group:
42+
Session Context
43+
---------------
44+
45+
The first statement group creates a :py:class:`~datafusion.context.SessionContext`.
5146

5247
.. code-block:: python
5348
5449
# create a context
5550
ctx = datafusion.SessionContext()
5651
57-
creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state
58-
of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality:
52+
A Session Context is the main interface for executing queries with DataFusion. It maintains the state
53+
of the connection between a user and an instance of the DataFusion engine. Additionally it provides
54+
the following functionality:
5955

60-
- Create a DataFrame from a CSV or Parquet data source.
61-
- Register a CSV or Parquet data source as a table that can be referenced from a SQL query.
62-
- Register a custom data source that can be referenced from a SQL query.
56+
- Create a DataFrame from a data source.
57+
- Register a data source as a table that can be referenced from a SQL query.
6358
- Execute a SQL query
6459

60+
DataFrame
61+
---------
62+
6563
The second statement group creates a :code:`DataFrame`,
6664

6765
.. code-block:: python
6866
69-
# create a RecordBatch and a new DataFrame from it
70-
batch = pyarrow.RecordBatch.from_arrays(
71-
[pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
72-
names=["a", "b"],
73-
)
74-
df = ctx.create_dataframe([[batch]])
67+
# Create a DataFrame from a file
68+
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
7569
7670
A DataFrame refers to a (logical) set of rows that share the same column names, similar to a `Pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_.
7771
DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as :code:`read_csv`, and can then be modified by
7872
calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
7973
and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
8074

81-
The third statement uses :code:`Expressions` to build up a query definition.
75+
Expressions
76+
-----------
77+
78+
The third statement uses :code:`Expressions` to build up a query definition. You can find
79+
explanations for what the functions below do in the user documentation for
80+
:py:func:`~datafusion.col`, :py:func:`~datafusion.lit`, :py:func:`~datafusion.functions.round`,
81+
and :py:func:`~datafusion.expr.Expr.alias`.
8282

8383
.. code-block:: python
8484
8585
df = df.select(
86-
col("a") + col("b"),
87-
col("a") - col("b"),
86+
"trip_distance",
87+
col("total_amount").alias("total"),
88+
(f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"),
8889
)
8990
90-
Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it,
91-
collecting all results into a list of `RecordBatch <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_.
91+
Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan
92+
represented by the DataFrame into a physical plan and execute it, collecting all results and
93+
displaying them to the user. It is important to note that DataFusion performs lazy evaluation
94+
of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show`
95+
or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query.

docs/source/user-guide/common-operations/aggregations.rst

+1-9
Original file line numberDiff line numberDiff line change
@@ -26,15 +26,7 @@ to form a single summary value. For performing an aggregation, DataFusion provid
2626

2727
.. ipython:: python
2828
29-
import urllib.request
30-
from datafusion import SessionContext
31-
from datafusion import col, lit
32-
from datafusion import functions as f
33-
34-
urllib.request.urlretrieve(
35-
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
36-
"pokemon.csv",
37-
)
29+
from datafusion import SessionContext, col, lit, functions as f
3830
3931
ctx = SessionContext()
4032
df = ctx.read_csv("pokemon.csv")

docs/source/user-guide/common-operations/functions.rst

-6
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,8 @@ We'll use the pokemon dataset in the following examples.
2525

2626
.. ipython:: python
2727
28-
import urllib.request
2928
from datafusion import SessionContext
3029
31-
urllib.request.urlretrieve(
32-
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
33-
"pokemon.csv",
34-
)
35-
3630
ctx = SessionContext()
3731
ctx.register_csv("pokemon", "pokemon.csv")
3832
df = ctx.table("pokemon")

docs/source/user-guide/common-operations/index.rst

+2
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@
1818
Common Operations
1919
=================
2020

21+
The contents of this section are designed to guide a new user through how to use DataFusion.
22+
2123
.. toctree::
2224
:maxdepth: 2
2325

docs/source/user-guide/common-operations/select-and-filter.rst

+4-7
Original file line numberDiff line numberDiff line change
@@ -21,18 +21,15 @@ Column Selections
2121
Use :py:func:`~datafusion.dataframe.DataFrame.select` for basic column selection.
2222

2323
DataFusion can work with several file types, to start simple we can use a subset of the
24-
`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_
24+
`TLC Trip Record Data <https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page>`_,
25+
which you can download `here <https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet>`_.
2526

2627
.. ipython:: python
27-
28-
import urllib.request
29-
from datafusion import SessionContext
3028
31-
urllib.request.urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
32-
"yellow_trip_data.parquet")
29+
from datafusion import SessionContext
3330
3431
ctx = SessionContext()
35-
df = ctx.read_parquet("yellow_trip_data.parquet")
32+
df = ctx.read_parquet("yellow_tripdata_2021-01.parquet")
3633
df.select("trip_distance", "passenger_count")
3734
3835
For mathematical or logical operations use :py:func:`~datafusion.col` to select columns, and give meaningful names to the resulting

docs/source/user-guide/common-operations/windows.rst

-6
Original file line numberDiff line numberDiff line change
@@ -30,16 +30,10 @@ We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
3030

3131
.. ipython:: python
3232
33-
import urllib.request
3433
from datafusion import SessionContext
3534
from datafusion import col
3635
from datafusion import functions as f
3736
38-
urllib.request.urlretrieve(
39-
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv",
40-
"pokemon.csv",
41-
)
42-
4337
ctx = SessionContext()
4438
df = ctx.read_csv("pokemon.csv")
4539

0 commit comments

Comments
 (0)