Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ibis): fix the function white list of DuckDB #1129

Merged
merged 3 commits into from
Apr 1, 2025

Conversation

goldmedal
Copy link
Contributor

@goldmedal goldmedal commented Apr 1, 2025

Description

  • Remove the duplicate functions
  • Fix the return types
  • Add more type mapping for DuckDB type

Known issues

  • The map type isn't supported now.
  • Consider supporting any for the return type or parameter type.

Summary by CodeRabbit

  • New Features
    • Enhanced data processing by broadening support for various type formats. The system now recognizes both array and list notations and supports additional numeric, binary, and time-related formats, ensuring improved compatibility and data handling.

Copy link

coderabbitai bot commented Apr 1, 2025

Walkthrough

This pull request updates test expectations in the local file connector by reducing the expected function count and changing the targeted function from "array_length" to "regexp_escape," along with corresponding updates to its description, parameters, and return type. It also expands data type support in core logical planning by modifying the conditions in the create_list_type and map_data_type functions to recognize both "array" and "list" and by adding several new data type mappings, including nuanced handling of timestamp types.

Changes

File Changes Summary
ibis-server/.../test_functions.py Updated test expectations in test_function_list: reduced function count from DATAFUSION_FUNCTION_COUNT + 437 to DATAFUSION_FUNCTION_COUNT + 429; changed the function under test from "array_length" to "regexp_escape" with updated description, parameter, and return type details.
wren-core/.../utils.rs Enhanced create_list_type and map_data_type functions: modified conditions to recognize both "array" and "list" types; added mappings for new data types (e.g., "utinyint", "usmallint", "uinteger", "ubigint", "blob", "hugeint", "uhugeint", "bit", "timestamp_ns"); extended timestamp mapping.

Sequence Diagram(s)

sequenceDiagram
    participant C as Caller
    participant M as map_data_type
    C->>M: Call map_data_type("input data type")
    M->>M: Check if input starts with "array" or "list"
    M->>M: Evaluate additional conditions for new types (e.g., "utinyint", "timestamp with time zone")
    M->>C: Return corresponding DataType mapping
Loading

Possibly related PRs

Suggested labels

ibis, dependencies, python

Suggested reviewers

  • wwwy3y3

Poem

I'm a rabbit with a skip in my stride,
Hopping through code with updates wide.
Functions and types all change in play,
Bringing fresh joy to our workday.
A nibble of code, a hop of cheer—
Celebrating changes with a bunny cheer!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added core ibis rust Pull requests that update Rust code python Pull requests that update Python code labels Apr 1, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-core/core/src/logical_plan/utils.rs (2)

104-187: Missing support for DuckDB map type.

The PR objectives mentioned that the "map" type is not supported. Consider adding a comment in the code to document this known limitation, potentially with a TODO.

 pub fn map_data_type(data_type: &str) -> Result<DataType> {
     let lower = data_type.to_lowercase();
     let lower_data_type = lower.as_str();
-    // TODO: try parse nested type by arrow
+    // TODO: try parse nested type by arrow
+    // Known limitation: DuckDB "map" type is currently not supported
     // Currently, we don't care about the element type of the array or struct.
     // We only care about the array or struct itself.

170-180: Consider adding support for the "any" type.

The PR objectives mentioned future consideration for the "any" type for return types or parameter types. Consider adding a placeholder mapping for the "any" type.

         "time" => DataType::Time32(TimeUnit::Nanosecond), // chose the smallest time unit
         "null" => DataType::Null,
+        // TODO: Consider proper support for the "any" type in the future
+        "any" => DataType::Null, // Temporary mapping for the "any" type
         // Trino Compatible Types
         "varbinary" => DataType::Binary,
         // DuckDB Compatible Types
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52f9163 and e6c20d1.

⛔ Files ignored due to path filters (1)
  • ibis-server/resources/function_list/duckdb.csv is excluded by !**/*.csv
📒 Files selected for processing (2)
  • ibis-server/tests/routers/v3/connector/local_file/test_functions.py (1 hunks)
  • wren-core/core/src/logical_plan/utils.rs (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: cargo test (win64)
  • GitHub Check: cargo test (macos-aarch64)
  • GitHub Check: cargo test (macos)
  • GitHub Check: ci
  • GitHub Check: test
  • GitHub Check: cargo check
🔇 Additional comments (7)
ibis-server/tests/routers/v3/connector/local_file/test_functions.py (2)

56-57: Reduced DuckDB function count confirmed.

The comment and assertion correctly reflect the reduction in DuckDB functions from 437 to 429, which aligns with the PR objective of removing duplicate functions.


58-66:

❓ Verification inconclusive

Function test updated properly from array_length to regexp_escape.

The test now verifies the correct function signature for regexp_escape instead of array_length, with appropriate parameter and return types. This change matches the PR objective of correcting function return types.


🏁 Script executed:

#!/bin/bash
# Verify that the regexp_escape function is properly defined in the DuckDB functions list

# Check if the regexp_escape function exists in the DuckDB function list file
if [ -f "resources/function_list/duckdb.csv" ]; then
  grep -E "regexp_escape.*Utf8.*Utf8" resources/function_list/duckdb.csv
else
  echo "DuckDB function list file not found at expected location."
  # Try to find it elsewhere
  fd -g "*duckdb.csv" -x grep -E "regexp_escape.*Utf8.*Utf8" {}
fi

Length of output: 238


Action Needed: Confirm DuckDB function list file location and content

The test update in ibis-server/tests/routers/v3/connector/local_file/test_functions.py (lines 58–66) correctly changes the verification from array_length to regexp_escape with the intended parameter (Utf8) and return (Utf8) types. However, the automated check for the function signature in the DuckDB functions file did not find the expected file at resources/function_list/duckdb.csv.

Please manually verify the following:

  • That the DuckDB function list file exists in the repository (or has been relocated/renamed).
  • That the file contains the correct signature for regexp_escape (matching "regexp_escape.*Utf8.*Utf8").

Once confirmed, please update this review comment accordingly if any changes are needed.

wren-core/core/src/logical_plan/utils.rs (5)

30-30: Enhanced list type recognition to handle both "array" and "list" types.

Adding support for "list" as an equivalent to "array" improves DuckDB compatibility, as DuckDB uses both terms for similar concepts.


110-110: Expanded data type recognition to include both "array" and "list" prefixes.

This change consistently implements list type recognition across the codebase, ensuring that types starting with either "array" or "list" are properly handled.


120-131: Added unsigned integer type mappings for DuckDB compatibility.

These additions properly support unsigned integer types in DuckDB, improving type compatibility for the connector.


145-153: Expanded timestamp with timezone type handling.

The implementation now correctly handles various ways timestamp with timezone types can be expressed in DuckDB, including "timestamp with time zone" (with spaces) and "time with time zone". The comment properly explains why "time with time zone" is mapped to timestamp.


174-179: Added additional DuckDB-specific type mappings.

The implementation adds proper support for DuckDB-specific types:

  • blob → Binary
  • hugeint → Int64 (with appropriate comment about lack of direct support)
  • uhugeint → UInt64 (with appropriate comment about lack of direct support)
  • bit → Boolean (with appropriate comment about lack of direct support)
  • timestamp_ns → Timestamp(TimeUnit::Nanosecond, None)

These mappings improve compatibility with DuckDB data sources.

@goldmedal goldmedal requested a review from douenergy April 1, 2025 05:03
Copy link
Contributor

@douenergy douenergy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @goldmedal, it is a heavy job.

@douenergy douenergy merged commit e33e6f6 into Canner:main Apr 1, 2025
16 checks passed
@goldmedal goldmedal deleted the fix/func-white-list branch April 1, 2025 05:37
@douenergy
Copy link
Contributor

closed #1126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core ibis python Pull requests that update Python code rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants