Skip to content

epic: Improve Cortex Engine Management #1416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 6 tasks
gabrielle-ong opened this issue Oct 3, 2024 · 17 comments · Fixed by #1546
Closed
3 of 6 tasks

epic: Improve Cortex Engine Management #1416

gabrielle-ong opened this issue Oct 3, 2024 · 17 comments · Fixed by #1546
Assignees
Milestone

Comments

@gabrielle-ong
Copy link
Contributor

gabrielle-ong commented Oct 3, 2024

Goal

  • Cortex has a clear Engine Abstraction
  • Engines have dependency management
    • Dependency resolution (OS, drivers)
    • Dependency handling (e.g. missing drivers, error messages)
  • Engines have state management
    • Engines have metadata stored in cortex.db

Tasklist

Open Questions

  • Do we allow users to run multiple engines in parallel
    • Yes, engines can in parallel
    • Dan: they are running models in parallel, that run engines

Appendix

Improvements to #1072
Image

@gabrielle-ong gabrielle-ong converted this from a draft issue Oct 3, 2024
@dan-menlo dan-menlo changed the title epic: Improve Cortex Engine Management epic: Improve Cortex Engine Management and Docs Oct 13, 2024
@dan-menlo
Copy link
Contributor

User Journey

  • I download Cortex for the first time
  • Cortex Local Installer or Network installer auto-detects my hardware and installs the appropriate variant(s)
    • Decision: should we install all variants?
    • Decision: what happens if user has both an AMD GPU and a Nvidia GPU?
  • User downloads model and runs it
    • Tries using default llama.cpp
    • Can switch to other llama.cpp variant and try it (e.g. Vulkan)

@dan-menlo dan-menlo assigned namchuai and unassigned vansangpfiev Oct 14, 2024
@dan-menlo dan-menlo added this to the v1.0.2 milestone Oct 14, 2024
@freelerobot freelerobot moved this from Investigating to Planning in Menlo Oct 15, 2024
@dan-menlo dan-menlo changed the title epic: Improve Cortex Engine Management and Docs planning: Improve Cortex Engine Management and Docs Oct 19, 2024
@namchuai
Copy link
Contributor

@dan-homebrew, here's my opinion about this task. I think should consolidate both #1453 and #1454 into this ticket.

  1. Decision: should we install all variants?
  • I don't think we should install all variant by default.
  1. Decision: what happens if a user has both an AMD GPU and an Nvidia GPU?
  • From my understanding, a user can only have either AMD or Nvidia GPU running at a time. So, I guess the question is more like how to manage engines if user switching back and forth from AMD to Nvidia GPU and vice versa. This use case can expand because the user can unplug the GPU, and we only have CPU left.

So, I think we need a solution at runtime, detect current hardware state and automatically try to choose the best engine possible. We also have to provide a user a way to specify the engine version and variant that user want to run their model.

Approach

I fully agree with option 1 you provided over #1453

File system structure

engines
|--cortex.llamacpp
    |--<variant-version>(e.g. mac-arm64-0.1.34)
        |--libengine.dylib
        |--version.txt

Tasks breakdown

  1. Restructuring engine file structure
  2. Make small POC to set DLL path on Windows to be sure
  3. Engine install command allow download and install with specific version and variant
  4. Engine use command cortex engines llama-cpp use <variant-version> allow user to set default version and variant for a particular engine
  5. Add -v command cortex engines llama-cpp -v to show current selected (default) engine for llamacpp.
  6. Add list command cortex engines llama-cpp [filter] to display list of installed engines for llamacpp and filterable.
  7. Update engine uninstall command.
  8. Update cortex ps command to display engine variant & version used to load model.
  9. Create and update corresponding HTTP APIs.
  10. Integrate loading model with this new default engine logic.

Edge cases

  1. I'm not sure if multiple variant of the same engine can be loaded both at once.
  2. Have to set correct the DLL path for windows dynamically

@dan-menlo
Copy link
Contributor

dan-menlo commented Oct 24, 2024

@namchuai Merging "Cortex handles Engine Variants" into this issue:

Cortex handles Engine Variants

Tasklist

  • API Design
  • CLI Design

Questions

Scenario

  • There are some users who have both Nvidia and AMD GPUs in their computer
    • Jan already supports Vulkan
    • Under the hood, this requires us to switch from llama-cuda-avx2 to llama-vulkan
    • llama.cpp alone has 18 variants at the moment

Cortex needs an elegant way to handle different engine versions + variants, without confusing the user. From my naive perspective, there are two key approaches

Option 1: Every engine is versioned, and maintains a list of variants that it can use

  • Engines are versioned, and each version has several variants that can be chosen from
    • CLI: we would support a nvm-like use command
    • API: /engines API endpoint would have a use endpoint
> cortex engines get llama.cpp 
{ 
    version: b3919
    ...
}

> cortex engines llama.cpp variants list
llama-b3912-bin-win-hip-x64-gfx1030
llama-b3912-bin-win-cuda-cu11.7.1-x64

> cortex engines llama.cpp use llama-b3912-bin-win-cuda-cu11.7.1

Option 2: Every engine version/variant is a first-class Engine citizen

  • We treat every single engine version/variant as a first-class engine citizen (e.g. llama-b3919-avx-cuda)
    • Users will basically run models using a specific engine variant/version
    • cortex engines list will show a massive long list of engines
  • I don't think this is doable, tbh
> cortex engines list

llama.cpp-b3919-cuda
llama.cpp-b3821-vulkan

@dan-menlo
Copy link
Contributor

dan-menlo commented Oct 24, 2024

@namchuai Merging "Cortex handles Engine Versions":

Cortex handles Engine Versions

Tasklist

  • API Design
  • CLI Design
  • Cortex Stable defines llama.cpp version

Design

API

CLI

# CLI
> cortex engines update llama.cpp

# API
POST /engines/{engine}/update

Open question: should we allow users to run different versions of llama.cpp?

> cortex engines llama.cpp versions
1. b3919
2. b3909

> cortex engines

Release Management

Cortex Stable and Nightly defines a llama.cpp version that it supports

  • cortex update will update llama.cpp to the supported version

Cortex Nightly automatically pulls the latest llama.cpp, and forces us to fix it?

@namchuai
Copy link
Contributor

namchuai commented Oct 24, 2024

Here's some draft. Will update it from time to time.

Engine install

$ cortex engines install llama-cpp

Will list stable release from cortex.llamacpp repository

Requirements

  • Support pagination.
  • If a variant is installed, we add (Installed) at the end.
  • If a variant is used, we add (Current) at the end.
  • When remove an engine, and it's a current-used engine, we need to update the current to empty or null
  • When install an engine, we automatically set it as current-used engine.
  • If user enter without select a version, we pick the latest.
  • Allow user to input the version number. Accept version start with v and without v. E.g. cortex engines install llama-cpp v0.1.36
  • Publish time should be displayed along with version. Should be local time.

Sample output

Available versions:
1. v0.1.36 | 2024-10-24T01:36:30Z
2. v0.1.35 | 2024-10-22T04:43:49Z (Installed)
3. v0.1.34 | 2024-10-01T02:53:52Z

Enter number to select: _
  1. After user selected the version, we will as user to select variant

Selected llama-cpp version: v0.1.36
Available variants:

  1. linux-amd64-avx-cuda-11-7 (Installed)
  2. linux-amd64-avx-cuda-12-0 (Recommended)
    ...

Enter number to select: _
Things to consider:

  • User selected, then it will automatically set as used.

Question 1: how to know which engine is being set as used?
Question 2: where to store which engine to use?

  1. cortex engines llama-cpp use
    List available downloaded llama-cpp engines variants and versions. If an engine is being used, then display (Current)

  2. linux-amd64-avx-cuda-11-7

  3. linux-amd64-avx-cuda-12-0 (Current)
    ...
    Enter number to select: _

  4. cortex engines llama-cpp update

@namchuai namchuai mentioned this issue Oct 24, 2024
3 tasks
@dan-menlo
Copy link
Contributor

dan-menlo commented Oct 30, 2024

@namchuai For this issue, can you make sure we come up with a clear API first (e.g. /engines):

  • We will need a clear API for choosing an engine variant (e.g. PUT?)
  • This will be used by Jan - llama.cpp Extension will allow user to select the Variant
  • Additionally, can we run Models with a specific Engine Variant?
  • The CLI "selector" should belong to the CLI binary, and call the API

@namchuai
Copy link
Contributor

namchuai commented Oct 30, 2024

@dan-homebrew @gabrielle-ong , here's the APIs that I think we will have.

Engine Management API Documentation

Basic Engine Operations

Install Engine Variant

POST /engines/{engine_type}/{version}/{variant}

Uninstall Engine Variant

DELETE /engines/{engine_type}/{version}/{variant}

List Installed Engine Variants

GET /engines/{engine_type}

Response:

[
    {
        "engine": "llama-cpp",
        "name": "mac-arm64",
        "version": "0.1.35-28.10.24"
    },
    {
        "engine": "llama-cpp",
        "name": "linux-amd64-avx",
        "version": "0.1.35-27.10.24"
    }
]

Release Information

List Released Engine Versions

GET /engines/{engine_type}?release=true

Response:

[
    {
        "draft": false,
        "name": "v0.1.37",
        "prerelease": true,
        "published_at": "2024-10-30T03:39:23Z",
        "url": "https://api.github.com/repos/janhq/cortex.llamacpp/releases/182594588"
    },
    {
        "draft": false,
        "name": "v0.1.35-28.10.24",
        "prerelease": true,
        "published_at": "2024-10-28T17:30:48Z",
        "url": "https://api.github.com/repos/janhq/cortex.llamacpp/releases/182309346"
    }
]

List Released Engine Variants

GET /engines/{engine_type}/{version}

Response:

[
    {
        "created_at": "2024-10-28T17:35:51Z",
        "download_count": 0,
        "name": "linux-amd64-avx-cuda-11-7",
        "size": 151240428
    },
    {
        "created_at": "2024-10-28T17:34:05Z",
        "download_count": 0,
        "name": "linux-amd64-avx",
        "size": 1548720
    }
]

Default Engine Management

Get Default Engine Variant

GET /engines/{engine_type}/default

Set Default Engine Variant

POST /engines/{engine_type}/{version}/{variant}/default

Engine Runtime Operations

Load Engine

POST /engines/{engine_type}/load

Uses the variant set as default

Unload Engine

DELETE /engines/{engine_type}/load

Uses the variant set as default

Update Engine

Update the current (default) engine variant to latest.

POST /engines/{engine_type}/update

Success response:

{
    "engine": "cortex.llamacpp",
    "from": "v0.1.35-28.10.24",
    "to": "0.1.35",
    "variant": "mac-arm64"
}

Failed response:

{
    "message": "Engine cortex.llamacpp, mac-arm64 is already up-to-date! Version v0.1.35"
}

List All Engines and Variants

Get All Engines

GET /engines

Response:

{
    "llama-cpp": [
        {
            "engine": "llama-cpp",
            "name": "mac-arm64",
            "version": "0.1.35-28.10.24"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx",
            "version": "0.1.35-27.10.24"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx",
            "version": "0.1.36"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx2-cuda-12-0",
            "version": "0.1.36"
        }
    ],
    "onnxruntime": [],
    "tensorrt-llm": []
}

@gabrielle-ong
Copy link
Contributor Author

@namchuai, @dan-homebrew:
Adding some thoughts/questions on the engine management

Issue:

  • I needed to upgrade cortex.llama-cpp from 0.1.37 to 0.1.37-01.11.24 to test our changes
  • cortex engines install llama-cpp did not install the latest version, it still installed v0.1.37
  • This required manual cortex engines install llama-cpp -v v0.1.37-01.11.24, which I would have not known there was a new version
  1. Should we align the cortex.llama-cpp version names with llama.cpp eg 0.1.37-b4033 instead of our date format? https://github.com/ggerganov/llama.cpp/releases
  2. cortex engines update llama-cpp: we should have a way to update engines when new engines are available, and delete the old version
  3. Idea: cortex updatecan also chaincortex engines update`

@github-project-automation github-project-automation bot moved this from Scheduled to Review + QA in Menlo Nov 5, 2024
@gabrielle-ong gabrielle-ong removed this from the v1.0.2 milestone Nov 6, 2024
@gabrielle-ong gabrielle-ong changed the title planning: Improve Cortex Engine Management epic: Improve Cortex Engine Management Nov 6, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.2 milestone Nov 6, 2024
@TC117
Copy link

TC117 commented Nov 7, 2024

  • Since we have engines load / unload Endpoint, do we support CLI also ?
PS C:\WINDOWS\system32> cortex-nightly.exe engines -h
Subcommands for managing engines
Usage:
cortex-nightly.exe engines [options] [subcommand]

Options:
  -h,--help                   Print this help message and exit

Subcommands:
  list                        List all cortex engines
  install                     Install engine
  uninstall                   Uninstall engine
  update                      Update engine
  use                         Set engine as default
  get                         Get engine info
PS C:\WINDOWS\system32> cortex-nightly.exe -v
v1.0.1-227

New Cortex release available: v1.0.1-227 -> v1.0.1-228
To update, run: cortex-nightly.exe update
PS C:\WINDOWS\system32>
  • POST /engines/{engine_type}/{version}/{variant}/default should give message successful set default .... not return engines details"
    the cURL that I use
POST http://127.0.0.1:39281/v1/engines/llama-cpp/default?version=v0.1.37&variant=windows-amd64-avx2-cuda-12-0

Image

@TC117
Copy link

TC117 commented Nov 8, 2024

  • Unload return a HTML format
    Image

  • List Released Engine Variants not mention is doc
    GET /engines/{engine_type}/{version}
    Image

@TC117
Copy link

TC117 commented Nov 12, 2024

Hi @namchuai,
Could you please take a look at some point above

@gabrielle-ong
Copy link
Contributor Author

Hi @namchuai, summarising the list of issues:

  1. Add CLI endpoint Load and Unload - this will be useful as I get the error "model is not yet loaded!"
  2. edit API responses for Default -
  3. edit API response for Unload engine (404)

I'll work on the docs:
3. Swagger file cortex.json add - list engines
4. CLI docs - add cortex engines endpoints

@namchuai
Copy link
Contributor

Add CLI endpoint Load and Unload - this will be useful as I get the error "model is not yet loaded!"
edit API responses for Default -
edit API response for Unload engine (404)

sorry @gabrielle-ong @TC117 for late response, I will work on this list.

@gabrielle-ong
Copy link
Contributor Author

Thanks @namchuai!
Can we add the -m flag in CLI help command options?
image

@TC117
Copy link

TC117 commented Nov 15, 2024

image
Get v1/engines/:name
should return variant not name

@gabrielle-ong
Copy link
Contributor Author

Also tracking this followup task for engines API endpoint (move params to body instead of path)
#1684

@gabrielle-ong gabrielle-ong modified the milestones: v1.0.4, v1.0.3 Nov 18, 2024
@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Nov 22, 2024
@gabrielle-ong
Copy link
Contributor Author

gabrielle-ong commented Nov 22, 2024

Thanks @namchuai, marking as complete - released with Cortex 1.0.3 and Jan 0.5.9.
Linked to Followup tasks:
Jan: menloresearch/jan#4025

Cortex:
#1684
#1638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants