Skip to content

Support Tiktoken Gpt-4.1 Model #7453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 2, 2025

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented May 2, 2025

Fixes #7450

@Copilot Copilot AI review requested due to automatic review settings May 2, 2025 00:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the new "gpt-4.1" model in the Tiktoken tokenizer, addressing issue #7450.

  • Added test cases for "gpt-4.1" and "gpt-4.1-mini" in the test suite.
  • Updated model prefix mappings in the tokenizer to support both dashed and non-dashed "gpt-4.1" formats.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
test/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs Added inline test data for "gpt-4.1" variants to validate encoding support.
src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs Inserted new mappings for "gpt-4.1-" and "gpt-4.1" models, mapping them to ModelEncoding.O200kBase.

@tarekgh tarekgh added this to the ML.NET 5.0 milestone May 2, 2025
@tarekgh
Copy link
Member Author

tarekgh commented May 2, 2025

@tarekgh tarekgh requested a review from michaelgsharp May 2, 2025 00:20
Copy link

codecov bot commented May 2, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.99%. Comparing base (d4f690c) to head (0f08006).
Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7453      +/-   ##
==========================================
- Coverage   69.00%   68.99%   -0.02%     
==========================================
  Files        1483     1482       -1     
  Lines      274563   273879     -684     
  Branches    28395    28254     -141     
==========================================
- Hits       189455   188955     -500     
+ Misses      77672    77537     -135     
+ Partials     7436     7387      -49     
Flag Coverage Δ
Debug 68.99% <100.00%> (-0.02%) ⬇️
production 63.27% <100.00%> (+<0.01%) ⬆️
test 89.46% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs 78.55% <100.00%> (+0.07%) ⬆️
...est/Microsoft.ML.Tokenizers.Tests/TiktokenTests.cs 99.00% <ø> (ø)

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tarekgh tarekgh merged commit 1041dc3 into dotnet:main May 2, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add GPT 4.1 to Tiktoken Tokenizer
2 participants