Skip to content

epic: SQLite database implementation #1241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #1154
nguyenhoangthuan99 opened this issue Sep 18, 2024 · 11 comments · Fixed by #1240 or #1336
Closed
Tracked by #1154

epic: SQLite database implementation #1241

nguyenhoangthuan99 opened this issue Sep 18, 2024 · 11 comments · Fixed by #1240 or #1336
Assignees
Labels
category: model management Model pull, yaml, model state P0: critical Mission critical type: epic A major feature or initiative
Milestone

Comments

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 18, 2024

Implement a ModelListUtils class base on this discussion #1154

Overview

The ModelListUtils class provides a robust and thread-safe interface for managing a list of machine learning models. It handles operations such as adding, updating, deleting, and retrieving model information, with a focus on maintaining unique identifiers for each model.

Key Features

  1. Thread-safe Operations: Uses mutex locking to ensure thread safety for all operations.
  2. File-based Storage: Manages model information in a text file, allowing for persistence between program runs.
  3. Unique Identifier Generation: Automatically generates unique, shortened aliases for models.
  4. CRUD Operations: Supports Creating, Reading, Updating, and Deleting model entries.
  5. Error Handling: Robust error checking and exception throwing for invalid operations.

Main Components

ModelEntry Struct

Represents a single model entry with fields:

  • model_id: Unique identifier for the model
  • author_repo_id: Author or repository identifier
  • branch_name: Branch name in the repository
  • path_to_model_yaml: Path to the model's YAML file
  • model_alias: Shortened alias for the model
  • status: Current status of the model (READY or RUNNING)

ModelListUtils Class Methods

  1. LoadModelList()

    • Loads model entries from the file, creating the file if it doesn't exist.
    • Returns a vector of ModelEntry objects.
  2. SaveModelList()

    • Saves the current list of model entries to the file.
  3. GetModelInfo(identifier)

    • Retrieves model information based on model_id or model_alias.
    • Throws an exception if the model is not found.
  4. AddModelEntry(new_entry, use_short_alias)

    • Adds a new model entry to the list.
    • Optionally generates a shortened alias for the model.
    • Ensures uniqueness of model_id and model_alias.
  5. UpdateModelEntry(identifier, updated_entry)

    • Updates an existing model entry.
    • Identified by either model_id or model_alias.
  6. DeleteModelEntry(identifier)

    • Deletes a model entry if it exists and is in READY state.
  7. PrintModelInfo(entry)

    • Prints detailed information about a model entry.
  8. GenerateShortenedAlias(model_id, entries)

    • Generates a unique, shortened alias for a model.
    • Uses a hierarchical approach to create aliases, prioritizing brevity while ensuring uniqueness.

Alias Generation Logic

The GenerateShortenedAlias function creates aliases in the following order:

  1. Filename only (e.g., model_id_xxx)
  2. Parent directory + filename (e.g., llama3.1-7b-gguf:model_id_xxx)
  3. Grandparent directory + parent directory + filename (e.g., bartowski:llama3.1-7b-gguf/model_id_xxx)
  4. Full path (e.g., huggingface.co:bartowski/llama3.1-7b-gguf/model_id_xxx)

It returns the shortest unique alias, appending a numeric suffix if necessary to ensure uniqueness.

@nguyenhoangthuan99 nguyenhoangthuan99 self-assigned this Sep 18, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 moved this from In Progress to In Review in Menlo Sep 18, 2024
@github-project-automation github-project-automation bot moved this from In Review to Completed in Menlo Sep 18, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 moved this from Completed to QA in Menlo Sep 19, 2024
@freelerobot freelerobot reopened this Sep 23, 2024
@github-project-automation github-project-automation bot moved this from QA to Scheduled in Menlo Sep 23, 2024
@freelerobot
Copy link
Contributor

freelerobot commented Sep 23, 2024

Problem

  • Cortex needs a model index to know which models are available (and if the model has been correctly configured)

Success Criteria

  • models.list does not duplicate information in model.yaml
    • models.list has a clear separate purpose than the yaml.
    • duplicating fields across list and yaml introduces edge cases & unmaintainability
  • cortex pull/run correctly updates models.list
  • cortex import correctly updates models.list

Related

Questions

  1. Naming: models.list or model.list? The former is more grammatically correct.
  2. Why duplicate fields like author_repo_id, branch_name across both model.yaml and models.list? Should we move fields from model.yaml into list instead? OR vice versa?
  3. Is there a better way to track model status than to do file IO during runtime?
  4. Should we version the models.list? i.e. if we change the schema in the future, it is trackable.
  5. Example: user symlinks a local model: llama3 local imported seems redundant
llama3 local imported /Users/nicolezhu/cortexcpp-nightly/models/imported/llama3.yml llama3 READY
  1. @louis-jan mentioned using a db at one point. Does this beg that question?
  2. When is model.list first created? Doesn't seem to be upon installation. Related: bug: Fail to get list model information: Unable to create model.list file #1285

Tldr

At the moment, the current models.list design is complicated. I am worried ab out the long term maintainability and edge cases.

  1. Should model.list just be a bloody database?
  2. What is the separation of concerns btw this and model.yaml (trying our best to not duplicate fields)

My 2cents:

  • models.list: tracks model metadata needed for locally managing models, specific to cortex. e.g. id, alias, yaml path, maybe binary path
  • model.yaml: tracks runtime parameters needed for running models, NOT specific to cortex runtime, e.g. source url, model name, all hyperparams...

cc @dan-homebrew

@freelerobot
Copy link
Contributor

freelerobot commented Sep 23, 2024

cc @louis-jan to take a look and lets discuss here 🙏

@freelerobot freelerobot added the category: model management Model pull, yaml, model state label Sep 23, 2024
@freelerobot freelerobot changed the title model.list implementation epic: model.list implementation Sep 23, 2024
@freelerobot freelerobot added type: epic A major feature or initiative P0: critical Mission critical labels Sep 23, 2024
@nguyenhoangthuan99
Copy link
Contributor Author

nguyenhoangthuan99 commented Sep 23, 2024

  1. I'll update to models.list.
  2. We can move all author_repo_id, branch_name to models.list
  3. Currently, Sang has a command for checking model status by calling to server. We will remove the model status field. We also don't use IO read file for check model status currently.
  4. We should version the models.list, I'll find the proper way to implement it.
  5. We fill the models.list with this content llama3 local imported /Users/nicolezhu/cortexcpp-nightly/models/imported/llama3.yml llama3 READY because we are organizing models.list as a text file, so for easier to implement and avoid unexpected bugs, we must ensure the number of element in a line is constant.
    for example:
llama3 local imported /Users/nicolezhu/cortexcpp-nightly/models/imported/llama3.yml llama3 READY

mistral-model-id mistral/mistral-nemo 2407 /Users/nicolezhu/cortexcpp-nightly/models/huggingface.co/mistral/mistral-nemo/Mistral-Nemo.yml mistral-nemo-alias READY

This is our final model folder structure from this discussion #1154

/models
    model.list
    
    # Huggingface GGUF Model Folder format
    # This assumes we have some sort of quantization selection wizard when downloading?
    /huggingface.co
        /bartowski
            /Mixtral-8x22b-v0.1-gguf (includes all quants)
                Mixtral-8x22B-v0.1-IQ3_M.yaml
                Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf...
                Mixtral-8x22B-v0.1-Q8_M.yaml
                Mixtral-8x22B-v0.1-Q8_M-00001-of-00005.gguf...

    # Built-in Library Model Folder format
    /cortex.so (this is our Built-in Model Library, based on Git, that will be mirrored across a few sites)
        /llama3.1 (model)
            /q4-tensorrt-llm (tag)
                ...engine_files
                model.yaml
            /q8-gguf **(tag)**
                model.yaml
   # Imported model from local 
    /imported
            <imported-model-id>.yaml

    # Future Model Source
    # Has its own model folder format

So we are aiming to make models.list can support model from many sources (local imported, cortexso, other hugging face branch, nvidia-ngc, ... ).

author_repo_id, branch_name also help us to know the source of model. We can use this information to show when list models.
6. I also think database for models.list instead of text file is necessary
7. Currently, models.list is first created as the first command cannot find models.list file. The issue #1285 caused by model pull command, still at old version and hasn't supported models.list yet. James and Sang is working with model run and model pull command with new models.list.

@freelerobot
Copy link
Contributor

Gotcha, much of the above makes sense:

  1. What's the reason for including the following in the model.list instead of yaml? (not saying its wrong, im just curious)
  • author_repo_id: Author or repository identifier
  • branch_name: Branch name in the repository
  1. Should we just include the models.list in installation, as we write the cortexcpp folder? What do we get from dynamically creating it later on?

@nguyenhoangthuan99
Copy link
Contributor Author

  1. Model Metadata and Configuration:
  • model.list: This is a header/metadata index containing information to identify models.
  • model.yml: This is a configuration file that includes information for loading and running models. In the future, it may also be used for features like recommending inference parameters(ngl, context length) based on model specifications.
  • author_repo_id and branch_name: These are not used for inference purposes. They are sometimes included in the name and model fields, but name and model primary purpose is to help the LlamaCpp engine distinguish between different models.
  • We can add the necessary folder structure during the installation step. I'll collaborate with Hien on implementing this.
  • Currently, both the engines and models folders are created automatically when they're not found at the first run.
  • For development purposes, I need to create a models.list file to prevent crashes when running test cases in the Continuous Integration (CI) environment.

@freelerobot
Copy link
Contributor

freelerobot commented Sep 23, 2024 via email

@dan-menlo
Copy link
Contributor

  1. I also think database for models.list instead of text file is necessary

@0xSage @nguyenhoangthuan99 @louis-jan: If we are eventually going to move to a simple SQLite database for model management, I would prefer to do it now:

SQLite vs. models.list

  • I think a simple SQLite database is better, if we take this approach
  • We have a lot of moving parts depending on models.list, and a file corruption would brick both Jan and Cortex
  • models.list fields are clearly structured, and SQL is better for this case

Versioning & Migration

  • We would "version" the SQLite database by Cortex version (e.g. v0.1.0)
  • Cortex Updater would run a database migration (up/down), vv @vansangpfiev
  • We don't need to implement this for v0.1, but keep in mind as our strategy

cc @vansangpfiev @namchuai

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 24, 2024

  1. What's the reason for including the following in the model.list instead of yaml? (not saying its wrong, im just curious)
  • author_repo_id: Author or repository identifier
  • branch_name: Branch name in the repository

@nguyenhoangthuan99 @vansangpfiev @louis-jan @namchuai Let's start dealing with Model List as a SQLite database, and thinking through clearly what we need for Cortex to function.

I also see SQLite database would also allow us to track things like status, which is useful for operations.

  • model_id: Unique identifier for the model
  • author_repo_id: Author or repository identifier
  • branch_name: Branch name in the repository
  • path_to_model_yaml: Path to the model's YAML file
  • model_alias: Shortened alias for the model
  • status: Current status of the model (READY or RUNNING)

However, I note in our discussion, the example seems to have a local, imported in the struct.

  • Are we also tracking "origin" of model?
  • I think this could be very useful, especially for symlinked models

As part of this discussion exercise, can we do a few mockup entries, to see whether our assumptions work?

model_id author_repo_id branch_name path_to_model_yaml model_alias
Cell Cell Cell Cell
Cell Cell Cell Cell Cell

@namchuai
Copy link
Contributor

IMO, I'm happy with using database approach. File-based approach does not play well with migration.
Couple of thoughts:

  • Have to change cortex code base quite a lot.
  • cortex update(maybe even downgrade) might need to update database schema as well?

@vansangpfiev
Copy link
Contributor

I agree with the database approach. @nguyenhoangthuan99 implemented a good interface for model.list, so only need to change the implementation to replace it with database.

@vansangpfiev vansangpfiev self-assigned this Sep 25, 2024
@vansangpfiev vansangpfiev changed the title epic: model.list implementation epic: SQLite database implementation Sep 25, 2024
@gabrielle-ong
Copy link
Contributor

gabrielle-ong commented Oct 3, 2024

Closing as complete - sqlite
Image

@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 3, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: model management Model pull, yaml, model state P0: critical Mission critical type: epic A major feature or initiative
Projects
Archived in project
7 participants