Skip to content

Feature: Support Arrow Flight SQL protocol #9832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kesavkolla opened this issue Feb 1, 2023 · 17 comments
Closed

Feature: Support Arrow Flight SQL protocol #9832

kesavkolla opened this issue Feb 1, 2023 · 17 comments
Assignees
Labels
C-feature Category: feature

Comments

@kesavkolla
Copy link

Summary

Currently databend support MySQL protocol, as an alternative to this databend also should support Arrow Flight SQL protocol.

databend is dealing with usecases of data warehouse/lakehouse where the data volumes are high. When a client is interacting with databend to query for data; it would be performant to support arrow data format. Typically lakehouse stores data in parquet file with MySQL protocol databend has to do deserialization from parquet to arrow and then back to MySQL data types. Again on the caller end people use data frames or MySQL result iterators this also requires serialization of types. With Arrow Flight SQL all of these serialization costs can be avoided. databend will convert parquet to arrow and does it's query operations then send arrow data directly as result. Clients can take that arrow data and can even directly send the arrow data to all the way visualization layers.

@kesavkolla kesavkolla added the C-feature Category: feature label Feb 1, 2023
@sundy-li sundy-li added the good first issue Category: good first issue label Feb 17, 2023
@sundy-li
Copy link
Member

It's a community feature now, someone interested in this could take this task.

@johnhaxx7
Copy link
Contributor

Hi @sundy-li , can you please share some general ideas on how to get start on this?

@sundy-li
Copy link
Member

Paper related to this feature: https://www.vldb.org/pvldb/vol10/p1022-muehleisen.pdf

@johnhaxx7
Copy link
Contributor

Thanks for the info! I'll take a closer look.

@xinlifoobar
Copy link
Contributor

Interested in this feature also. Any hands-on work is in planning yet I could help.

@sundy-li
Copy link
Member

sundy-li commented Mar 8, 2023

We already have flight protocol that could be used to communicate with other query nodes in cluster.

https://github.com/datafuselabs/databend/blob/23281a29ba0a89f9428e11fc4ccf3b0b83ec5a9e/src/query/service/src/api/rpc_service.rs

But it's for internal usage. We can have similar handler based on flight.

@sundy-li
Copy link
Member

sundy-li commented Mar 9, 2023

c++ implementation example, flight on duckdb/sqlite.

https://github.com/voltrondata/flight-duckdb-example

@xinlifoobar
Copy link
Contributor

Thanks for providing those. Just took a closer look at protocol differences between Arrow flights and it seems the Databend flight rpc service would be a really strong start!

Just 2 follow-ups during the investigations:

@sundy-li
Copy link
Member

There are some codes that already covered the protocol details

The api/rpc This is internal usage in cluster query (server to server).

But within this issue, we are going to support client --- server protocol.

@xinlifoobar
Copy link
Contributor

There are some codes that already covered the protocol details

The api/rpc This is internal usage in cluster query (server to server).

But within this issue, we are going to support client --- server protocol.

Sorry just didn't get it because the RPC listener is already on.

https://github.com/datafuselabs/databend/blob/main/src/binaries/query/main.rs#L196-L203

So to finish this issue, it just includes:

  1. Expose the RPC listener in the Console UI interface, and this should be done via adding a new service here https://github.com/datafuselabs/databend/tree/c1d824f0824664a539fefca3f41fafe941bb2f01/src/query/service/src/servers.
  2. Test any incompatibilities issues with the server and a popular arrow flight driver like jdbc or odbc?

@youngsofun youngsofun self-assigned this Mar 15, 2023
@youngsofun
Copy link
Member

I am working on it. if all goes well, the first version that can work with JDBC will be available by next weekend.

@sundy-li sundy-li removed the good first issue Category: good first issue label Mar 17, 2023
@kesavkolla
Copy link
Author

Awesome. Curious why would anyone want to use JDBC with flight. The whole point of flight server is we can get arrow data directly in columnar format. JDBC makes again row oriented data.

@sundy-li
Copy link
Member

JDBC is over flight SQL. If we support flight SQL, we can seamlessly connect to the jdbc ecosystem (many third-party tools use jdbc to connect).

https://www.dremio.com/blog/jdbc-driver-for-arrow-flight-sql/

But we will not use JDBC inside databend.

@youngsofun
Copy link
Member

youngsofun commented Mar 17, 2023

Awesome. Curious why would anyone want to use JDBC with flight. The whole point of flight server is we can get arrow data directly in columnar format. JDBC makes again row oriented data.

  1. many data tools used JDBC to connect various databases
  2. arrow has an official JDBC https://github.com/apache/arrow/tree/main/java/flight/flight-sql-jdbc-driver, and it seems to me the interface design of flight-sql is largely affected by JDBC, we can use it for testing.

by the way, do you know any client-side tools that use flight-SQL and columnar format directly? @kesavkolla

@xinlifoobar
Copy link
Contributor

xinlifoobar commented Mar 17, 2023

There are some codes that already covered the protocol details

The api/rpc This is internal usage in cluster query (server to server).
But within this issue, we are going to support client --- server protocol.

Sorry just didn't get it because the RPC listener is already on.

https://github.com/datafuselabs/databend/blob/main/src/binaries/query/main.rs#L196-L203

So to finish this issue, it just includes:

  1. Expose the RPC listener in the Console UI interface, and this should be done via adding a new service here https://github.com/datafuselabs/databend/tree/c1d824f0824664a539fefca3f41fafe941bb2f01/src/query/service/src/servers.
  2. Test any incompatibilities issues with the server and a popular arrow flight driver like jdbc or odbc?

I must apologize to @sundy-li for this because there are a lot of misunderstandings before I started to investigate Arrow Flight SQL. There are some deep dives afterward I could share.

  • By implementing arrow fligh sql client-server protocol, there is a pretty nice Flight.proto pb file for reference. At this stage, the RPC service could handle them but some actions, like do_handshake, do_get, do_put are not implemented.
  • The current databend-query includes an arrow crate, it is recommended to include an arrow-flight crate also for some grpc data structures.
  • Some pretty good examples could be found in duckdb's doc SQL on Apache Arrow. I like the idea to read from the arrow flight server and transfer to the python pandas frame for computations.

Let me know if I could still be of help @youngsofun

@sundy-li
Copy link
Member

done in #10732

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature Category: feature
Projects
None yet
Development

No branches or pull requests

6 participants