|
20 | 20 | Concepts
|
21 | 21 | ========
|
22 | 22 |
|
23 |
| -In this section, we will cover a basic example to introduce a few key concepts. |
| 23 | +In this section, we will cover a basic example to introduce a few key concepts. We will use the same |
| 24 | +source file as described in the :ref:`Introduction <guide>`, the Pokemon data set. |
24 | 25 |
|
25 |
| -.. code-block:: python |
| 26 | +.. ipython:: python |
26 | 27 |
|
27 |
| - import datafusion |
28 |
| - from datafusion import col |
29 |
| - import pyarrow |
| 28 | + from datafusion import SessionContext, col, lit, functions as f |
30 | 29 |
|
31 |
| - # create a context |
32 |
| - ctx = datafusion.SessionContext() |
| 30 | + ctx = SessionContext() |
33 | 31 |
|
34 |
| - # create a RecordBatch and a new DataFrame from it |
35 |
| - batch = pyarrow.RecordBatch.from_arrays( |
36 |
| - [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])], |
37 |
| - names=["a", "b"], |
38 |
| - ) |
39 |
| - df = ctx.create_dataframe([[batch]]) |
| 32 | + df = ctx.read_parquet("yellow_tripdata_2021-01.parquet") |
40 | 33 |
|
41 |
| - # create a new statement |
42 | 34 | df = df.select(
|
43 |
| - col("a") + col("b"), |
44 |
| - col("a") - col("b"), |
| 35 | + "trip_distance", |
| 36 | + col("total_amount").alias("total"), |
| 37 | + (f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"), |
45 | 38 | )
|
46 | 39 |
|
47 |
| - # execute and collect the first (and only) batch |
48 |
| - result = df.collect()[0] |
| 40 | + df.show() |
49 | 41 |
|
50 |
| -The first statement group: |
| 42 | +Session Context |
| 43 | +--------------- |
| 44 | + |
| 45 | +The first statement group creates a :py:class:`~datafusion.context.SessionContext`. |
51 | 46 |
|
52 | 47 | .. code-block:: python
|
53 | 48 |
|
54 | 49 | # create a context
|
55 | 50 | ctx = datafusion.SessionContext()
|
56 | 51 |
|
57 |
| -creates a :py:class:`~datafusion.context.SessionContext`, that is, the main interface for executing queries with DataFusion. It maintains the state |
58 |
| -of the connection between a user and an instance of the DataFusion engine. Additionally it provides the following functionality: |
| 52 | +A Session Context is the main interface for executing queries with DataFusion. It maintains the state |
| 53 | +of the connection between a user and an instance of the DataFusion engine. Additionally it provides |
| 54 | +the following functionality: |
59 | 55 |
|
60 |
| -- Create a DataFrame from a CSV or Parquet data source. |
61 |
| -- Register a CSV or Parquet data source as a table that can be referenced from a SQL query. |
62 |
| -- Register a custom data source that can be referenced from a SQL query. |
| 56 | +- Create a DataFrame from a data source. |
| 57 | +- Register a data source as a table that can be referenced from a SQL query. |
63 | 58 | - Execute a SQL query
|
64 | 59 |
|
| 60 | +DataFrame |
| 61 | +--------- |
| 62 | + |
65 | 63 | The second statement group creates a :code:`DataFrame`,
|
66 | 64 |
|
67 | 65 | .. code-block:: python
|
68 | 66 |
|
69 |
| - # create a RecordBatch and a new DataFrame from it |
70 |
| - batch = pyarrow.RecordBatch.from_arrays( |
71 |
| - [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])], |
72 |
| - names=["a", "b"], |
73 |
| - ) |
74 |
| - df = ctx.create_dataframe([[batch]]) |
| 67 | + # Create a DataFrame from a file |
| 68 | + df = ctx.read_parquet("yellow_tripdata_2021-01.parquet") |
75 | 69 |
|
76 | 70 | A DataFrame refers to a (logical) set of rows that share the same column names, similar to a `Pandas DataFrame <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html>`_.
|
77 | 71 | DataFrames are typically created by calling a method on :py:class:`~datafusion.context.SessionContext`, such as :code:`read_csv`, and can then be modified by
|
78 | 72 | calling the transformation methods, such as :py:func:`~datafusion.dataframe.DataFrame.filter`, :py:func:`~datafusion.dataframe.DataFrame.select`, :py:func:`~datafusion.dataframe.DataFrame.aggregate`,
|
79 | 73 | and :py:func:`~datafusion.dataframe.DataFrame.limit` to build up a query definition.
|
80 | 74 |
|
81 |
| -The third statement uses :code:`Expressions` to build up a query definition. |
| 75 | +Expressions |
| 76 | +----------- |
| 77 | + |
| 78 | +The third statement uses :code:`Expressions` to build up a query definition. You can find |
| 79 | +explanations for what the functions below do in the user documentation for |
| 80 | +:py:func:`~datafusion.col`, :py:func:`~datafusion.lit`, :py:func:`~datafusion.functions.round`, |
| 81 | +and :py:func:`~datafusion.expr.Expr.alias`. |
82 | 82 |
|
83 | 83 | .. code-block:: python
|
84 | 84 |
|
85 | 85 | df = df.select(
|
86 |
| - col("a") + col("b"), |
87 |
| - col("a") - col("b"), |
| 86 | + "trip_distance", |
| 87 | + col("total_amount").alias("total"), |
| 88 | + (f.round(lit(100.0) * col("tip_amount") / col("total_amount"), lit(1))).alias("tip_percent"), |
88 | 89 | )
|
89 | 90 |
|
90 |
| -Finally the :py:func:`~datafusion.dataframe.DataFrame.collect` method converts the logical plan represented by the DataFrame into a physical plan and execute it, |
91 |
| -collecting all results into a list of `RecordBatch <https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html>`_. |
| 91 | +Finally the :py:func:`~datafusion.dataframe.DataFrame.show` method converts the logical plan |
| 92 | +represented by the DataFrame into a physical plan and execute it, collecting all results and |
| 93 | +displaying them to the user. It is important to note that DataFusion performs lazy evaluation |
| 94 | +of the DataFrame. Until you call a method such as :py:func:`~datafusion.dataframe.DataFrame.show` |
| 95 | +or :py:func:`~datafusion.dataframe.DataFrame.collect`, DataFusion will not perform the query. |
0 commit comments