epic: Improve Cortex Engine Management #1416

gabrielle-ong · 2024-10-03T14:30:50Z

Goal

Cortex has a clear Engine Abstraction
Engines have dependency management
- Dependency resolution (OS, drivers)
- Dependency handling (e.g. missing drivers, error messages)
Engines have state management
- Engines have metadata stored in cortex.db

Tasklist

Docs https://cortex.so/docs/engines
- Cortex Engines Conceptual overview
- How to add an Engine
planning: Cortex handling Engine Variants #1453
planning: Cortex handles Engine Versions #1454
feat: Engines API endpoints should be consistent #1684

Open Questions

Do we allow users to run multiple engines in parallel
- Yes, engines can in parallel
- Dan: they are running models in parallel, that run engines

Appendix

Improvements to #1072

dan-menlo · 2024-10-13T09:01:03Z

User Journey

I download Cortex for the first time
Cortex Local Installer or Network installer auto-detects my hardware and installs the appropriate variant(s)
- Decision: should we install all variants?
- Decision: what happens if user has both an AMD GPU and a Nvidia GPU?
User downloads model and runs it
- Tries using default llama.cpp
- Can switch to other llama.cpp variant and try it (e.g. Vulkan)

namchuai · 2024-10-21T15:57:36Z

@dan-homebrew, here's my opinion about this task. I think should consolidate both #1453 and #1454 into this ticket.

Decision: should we install all variants?

I don't think we should install all variant by default.

Decision: what happens if a user has both an AMD GPU and an Nvidia GPU?

From my understanding, a user can only have either AMD or Nvidia GPU running at a time. So, I guess the question is more like how to manage engines if user switching back and forth from AMD to Nvidia GPU and vice versa. This use case can expand because the user can unplug the GPU, and we only have CPU left.

So, I think we need a solution at runtime, detect current hardware state and automatically try to choose the best engine possible. We also have to provide a user a way to specify the engine version and variant that user want to run their model.

Approach

I fully agree with option 1 you provided over #1453

File system structure

engines
|--cortex.llamacpp
    |--<variant-version>(e.g. mac-arm64-0.1.34)
        |--libengine.dylib
        |--version.txt

Tasks breakdown

Restructuring engine file structure
Make small POC to set DLL path on Windows to be sure
Engine install command allow download and install with specific version and variant
Engine use command cortex engines llama-cpp use <variant-version> allow user to set default version and variant for a particular engine
Add -v command cortex engines llama-cpp -v to show current selected (default) engine for llamacpp.
Add list command cortex engines llama-cpp [filter] to display list of installed engines for llamacpp and filterable.
Update engine uninstall command.
Update cortex ps command to display engine variant & version used to load model.
Create and update corresponding HTTP APIs.
Integrate loading model with this new default engine logic.

Edge cases

I'm not sure if multiple variant of the same engine can be loaded both at once.
Have to set correct the DLL path for windows dynamically

dan-menlo · 2024-10-24T02:41:09Z

@namchuai Merging "Cortex handles Engine Variants" into this issue:

Cortex handles Engine Variants

Cortex has a clean Engines abstraction that allows us to handle llama.cpp versions and variants
- llama.cpp versions: e.g. b3912, b3909
- llama.cpp variants: e.g. win-avx-512, win-hipblas, win-llvm
Jan's "Vulkan" functionality will be replaced by this API (cc @louis-jan)
- We will need to support more variants, as llama.cpp's support for sycl, llvm grow
We support multiple hardware types
- epic: llama.cpp should support LLVM for ARM-based CPUs #1251
- epic: llama.cpp should support sycl for Intel-based CPUs #1252
Related: ci: CI packaging of llama.cpp dependencies into the binary file by default for Cortex's integration into Jan #1369

Tasklist

API Design
CLI Design

Questions

Scenario

There are some users who have both Nvidia and AMD GPUs in their computer
- Jan already supports Vulkan
- Under the hood, this requires us to switch from llama-cuda-avx2 to llama-vulkan
- llama.cpp alone has 18 variants at the moment

Cortex needs an elegant way to handle different engine versions + variants, without confusing the user. From my naive perspective, there are two key approaches

Option 1: Every engine is versioned, and maintains a list of variants that it can use

Engines are versioned, and each version has several variants that can be chosen from
- CLI: we would support a nvm-like use command
- API: /engines API endpoint would have a use endpoint

> cortex engines get llama.cpp 
{ 
    version: b3919
    ...
}

> cortex engines llama.cpp variants list
llama-b3912-bin-win-hip-x64-gfx1030
llama-b3912-bin-win-cuda-cu11.7.1-x64

> cortex engines llama.cpp use llama-b3912-bin-win-cuda-cu11.7.1

Option 2: Every engine version/variant is a first-class Engine citizen

We treat every single engine version/variant as a first-class engine citizen (e.g. llama-b3919-avx-cuda)
- Users will basically run models using a specific engine variant/version
- cortex engines list will show a massive long list of engines
I don't think this is doable, tbh

> cortex engines list

llama.cpp-b3919-cuda
llama.cpp-b3821-vulkan

dan-menlo · 2024-10-24T02:41:38Z

@namchuai Merging "Cortex handles Engine Versions":

Cortex handles Engine Versions

llama.cpp is updated frequently, with multiple releases every few days
Cortex Engines needs to support an updating mechanism
This needs to be handled in conjunction with planning: Cortex handling Engine Variants #1453

Tasklist

API Design
CLI Design
Cortex Stable defines llama.cpp version

Design

API

CLI

# CLI
> cortex engines update llama.cpp

# API
POST /engines/{engine}/update

Open question: should we allow users to run different versions of llama.cpp?

> cortex engines llama.cpp versions
1. b3919
2. b3909

> cortex engines

Release Management

Cortex Stable and Nightly defines a llama.cpp version that it supports

cortex update will update llama.cpp to the supported version

Cortex Nightly automatically pulls the latest llama.cpp, and forces us to fix it?

namchuai · 2024-10-24T06:15:48Z

Here's some draft. Will update it from time to time.

Engine install

$ cortex engines install llama-cpp

Will list stable release from cortex.llamacpp repository

Requirements

Support pagination.
If a variant is installed, we add (Installed) at the end.
If a variant is used, we add (Current) at the end.
When remove an engine, and it's a current-used engine, we need to update the current to empty or null
When install an engine, we automatically set it as current-used engine.
If user enter without select a version, we pick the latest.
Allow user to input the version number. Accept version start with v and without v. E.g. cortex engines install llama-cpp v0.1.36
Publish time should be displayed along with version. Should be local time.

Sample output

Available versions:
1. v0.1.36 | 2024-10-24T01:36:30Z
2. v0.1.35 | 2024-10-22T04:43:49Z (Installed)
3. v0.1.34 | 2024-10-01T02:53:52Z

Enter number to select: _

After user selected the version, we will as user to select variant

Selected llama-cpp version: v0.1.36
Available variants:

linux-amd64-avx-cuda-11-7 (Installed)
linux-amd64-avx-cuda-12-0 (Recommended)
...

Enter number to select: _
Things to consider:

User selected, then it will automatically set as used.

Question 1: how to know which engine is being set as used?
Question 2: where to store which engine to use?

cortex engines llama-cpp use
List available downloaded llama-cpp engines variants and versions. If an engine is being used, then display (Current)
linux-amd64-avx-cuda-11-7
linux-amd64-avx-cuda-12-0 (Current)
...
Enter number to select: _
cortex engines llama-cpp update

dan-menlo · 2024-10-30T06:47:11Z

@namchuai For this issue, can you make sure we come up with a clear API first (e.g. /engines):

We will need a clear API for choosing an engine variant (e.g. PUT?)
This will be used by Jan - llama.cpp Extension will allow user to select the Variant
Additionally, can we run Models with a specific Engine Variant?
The CLI "selector" should belong to the CLI binary, and call the API

namchuai · 2024-10-30T07:12:16Z

@dan-homebrew @gabrielle-ong , here's the APIs that I think we will have.

Engine Management API Documentation

Basic Engine Operations

Install Engine Variant

POST /engines/{engine_type}/{version}/{variant}

Uninstall Engine Variant

DELETE /engines/{engine_type}/{version}/{variant}

List Installed Engine Variants

GET /engines/{engine_type}

Response:

[
    {
        "engine": "llama-cpp",
        "name": "mac-arm64",
        "version": "0.1.35-28.10.24"
    },
    {
        "engine": "llama-cpp",
        "name": "linux-amd64-avx",
        "version": "0.1.35-27.10.24"
    }
]

Release Information

List Released Engine Versions

GET /engines/{engine_type}?release=true

Response:

[
    {
        "draft": false,
        "name": "v0.1.37",
        "prerelease": true,
        "published_at": "2024-10-30T03:39:23Z",
        "url": "https://api.github.com/repos/janhq/cortex.llamacpp/releases/182594588"
    },
    {
        "draft": false,
        "name": "v0.1.35-28.10.24",
        "prerelease": true,
        "published_at": "2024-10-28T17:30:48Z",
        "url": "https://api.github.com/repos/janhq/cortex.llamacpp/releases/182309346"
    }
]

List Released Engine Variants

GET /engines/{engine_type}/{version}

Response:

[
    {
        "created_at": "2024-10-28T17:35:51Z",
        "download_count": 0,
        "name": "linux-amd64-avx-cuda-11-7",
        "size": 151240428
    },
    {
        "created_at": "2024-10-28T17:34:05Z",
        "download_count": 0,
        "name": "linux-amd64-avx",
        "size": 1548720
    }
]

Default Engine Management

Get Default Engine Variant

GET /engines/{engine_type}/default

Set Default Engine Variant

POST /engines/{engine_type}/{version}/{variant}/default

Engine Runtime Operations

Load Engine

POST /engines/{engine_type}/load

Uses the variant set as default

Unload Engine

DELETE /engines/{engine_type}/load

Uses the variant set as default

Update Engine

Update the current (default) engine variant to latest.

POST /engines/{engine_type}/update

Success response:

{
    "engine": "cortex.llamacpp",
    "from": "v0.1.35-28.10.24",
    "to": "0.1.35",
    "variant": "mac-arm64"
}

Failed response:

{
    "message": "Engine cortex.llamacpp, mac-arm64 is already up-to-date! Version v0.1.35"
}

List All Engines and Variants

Get All Engines

GET /engines

Response:

{
    "llama-cpp": [
        {
            "engine": "llama-cpp",
            "name": "mac-arm64",
            "version": "0.1.35-28.10.24"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx",
            "version": "0.1.35-27.10.24"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx",
            "version": "0.1.36"
        },
        {
            "engine": "llama-cpp",
            "name": "linux-amd64-avx2-cuda-12-0",
            "version": "0.1.36"
        }
    ],
    "onnxruntime": [],
    "tensorrt-llm": []
}

gabrielle-ong · 2024-11-05T10:53:21Z

@namchuai, @dan-homebrew:
Adding some thoughts/questions on the engine management

Issue:

I needed to upgrade cortex.llama-cpp from 0.1.37 to 0.1.37-01.11.24 to test our changes
cortex engines install llama-cpp did not install the latest version, it still installed v0.1.37
This required manual cortex engines install llama-cpp -v v0.1.37-01.11.24, which I would have not known there was a new version

Should we align the cortex.llama-cpp version names with llama.cpp eg 0.1.37-b4033 instead of our date format? https://github.com/ggerganov/llama.cpp/releases
cortex engines update llama-cpp: we should have a way to update engines when new engines are available, and delete the old version
Idea: cortex updatecan also chaincortex engines update`

TC117 · 2024-11-07T13:59:51Z

Since we have engines load / unload Endpoint, do we support CLI also ?

PS C:\WINDOWS\system32> cortex-nightly.exe engines -h
Subcommands for managing engines
Usage:
cortex-nightly.exe engines [options] [subcommand]

Options:
  -h,--help                   Print this help message and exit

Subcommands:
  list                        List all cortex engines
  install                     Install engine
  uninstall                   Uninstall engine
  update                      Update engine
  use                         Set engine as default
  get                         Get engine info
PS C:\WINDOWS\system32> cortex-nightly.exe -v
v1.0.1-227

New Cortex release available: v1.0.1-227 -> v1.0.1-228
To update, run: cortex-nightly.exe update
PS C:\WINDOWS\system32>

POST /engines/{engine_type}/{version}/{variant}/default should give message successful set default .... not return engines details"
the cURL that I use

POST http://127.0.0.1:39281/v1/engines/llama-cpp/default?version=v0.1.37&variant=windows-amd64-avx2-cuda-12-0

TC117 · 2024-11-08T00:46:57Z

Unload return a HTML format
List Released Engine Variants not mention is doc
GET /engines/{engine_type}/{version}

TC117 · 2024-11-12T07:42:00Z

Hi @namchuai,
Could you please take a look at some point above

gabrielle-ong · 2024-11-13T02:53:10Z

Hi @namchuai, summarising the list of issues:

Add CLI endpoint Load and Unload - this will be useful as I get the error "model is not yet loaded!"
edit API responses for Default -
edit API response for Unload engine (404)

I'll work on the docs:
3. Swagger file cortex.json add - list engines
4. CLI docs - add cortex engines endpoints

namchuai · 2024-11-13T05:52:56Z

Add CLI endpoint Load and Unload - this will be useful as I get the error "model is not yet loaded!"
edit API responses for Default -
edit API response for Unload engine (404)

sorry @gabrielle-ong @TC117 for late response, I will work on this list.

gabrielle-ong · 2024-11-15T08:00:44Z

Thanks @namchuai!
Can we add the -m flag in CLI help command options?

TC117 · 2024-11-15T10:48:48Z

Get v1/engines/:name
should return variant not name

gabrielle-ong · 2024-11-18T03:16:12Z

Also tracking this followup task for engines API endpoint (move params to body instead of path)
#1684

gabrielle-ong · 2024-11-22T10:05:04Z

Thanks @namchuai, marking as complete - released with Cortex 1.0.3 and Jan 0.5.9.
Linked to Followup tasks:
Jan: menloresearch/jan#4025

Cortex:
#1684
#1638

gabrielle-ong added this to Menlo Oct 3, 2024

gabrielle-ong converted this from a draft issue Oct 3, 2024

dan-menlo changed the title ~~epic: Improve Cortex Engine Management~~ epic: Improve Cortex Engine Management and Docs Oct 13, 2024

freelerobot assigned vansangpfiev Oct 13, 2024

dan-menlo assigned namchuai and unassigned vansangpfiev Oct 14, 2024

dan-menlo added this to the v1.0.2 milestone Oct 14, 2024

freelerobot moved this from Investigating to Planning in Menlo Oct 15, 2024

dan-menlo changed the title ~~epic: Improve Cortex Engine Management and Docs~~ planning: Improve Cortex Engine Management and Docs Oct 19, 2024

namchuai mentioned this issue Oct 24, 2024

idea: Options to select CPU and GPU binaries for llama-cpp engine #1390

Closed

This was referenced Oct 24, 2024

planning: Cortex handles Engine Versions #1454

Closed

planning: Cortex handling Engine Variants #1453

Closed

namchuai mentioned this issue Oct 24, 2024

feat: engine management #1546

Merged

3 tasks

dan-menlo mentioned this issue Oct 30, 2024

planning: Migrate llama.cpp Vulkan support to use Engine Management APIs menloresearch/jan#3913

Closed

1 task

louis-menlo mentioned this issue Oct 31, 2024

feat: Enhance command outputs #1376

Closed

This was referenced Oct 31, 2024

idea: AVX1/AVX2 architecture compatibility and CUDA support menloresearch/jan#3794

Closed

bug: Cortex Engine crashes - Doesn't start any local models after the latest update #1469

Closed

dan-menlo changed the title ~~planning: Improve Cortex Engine Management and Docs~~ planning: Improve Cortex Engine Management Oct 31, 2024

namchuai closed this as completed in #1546 Nov 5, 2024

github-project-automation bot moved this from Scheduled to Review + QA in Menlo Nov 5, 2024

gabrielle-ong removed this from the v1.0.2 milestone Nov 6, 2024

gabrielle-ong changed the title ~~planning: Improve Cortex Engine Management~~ epic: Improve Cortex Engine Management Nov 6, 2024

gabrielle-ong added this to the v1.0.2 milestone Nov 6, 2024

gabrielle-ong removed this from the v1.0.2 milestone Nov 8, 2024

gabrielle-ong mentioned this issue Nov 8, 2024

planning: Cortex Engines Management integrates with Jan data folders #1660

Closed

gabrielle-ong mentioned this issue Nov 13, 2024

docs: Doc fixes (tools, engines, API Reference is missing sidebar items) #1675

Closed

2 tasks

namchuai mentioned this issue Nov 13, 2024

feat: add load/unload engine cli #1678

Merged

3 tasks

louis-menlo mentioned this issue Nov 15, 2024

feat: update inference-cortex-extension to work with the new engine management of cortex.cpp menloresearch/jan#4025

Closed

gabrielle-ong modified the milestones: v1.0.4, v1.0.3 Nov 18, 2024

gabrielle-ong moved this from Review + QA to Completed in Menlo Nov 22, 2024

gabrielle-ong mentioned this issue Nov 28, 2024

roadmap: Jan Local Engine Management can pick Version and Variant menloresearch/jan#4128

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: Improve Cortex Engine Management #1416

epic: Improve Cortex Engine Management #1416

gabrielle-ong commented Oct 3, 2024 •

edited

Loading

dan-menlo commented Oct 13, 2024

namchuai commented Oct 21, 2024

dan-menlo commented Oct 24, 2024 •

edited

Loading

dan-menlo commented Oct 24, 2024 •

edited

Loading

namchuai commented Oct 24, 2024 •

edited

Loading

dan-menlo commented Oct 30, 2024 •

edited

Loading

namchuai commented Oct 30, 2024 •

edited

Loading

gabrielle-ong commented Nov 5, 2024

TC117 commented Nov 7, 2024 •

edited

Loading

TC117 commented Nov 8, 2024 •

edited

Loading

TC117 commented Nov 12, 2024

gabrielle-ong commented Nov 13, 2024

namchuai commented Nov 13, 2024

gabrielle-ong commented Nov 15, 2024

TC117 commented Nov 15, 2024 •

edited

Loading

gabrielle-ong commented Nov 18, 2024

gabrielle-ong commented Nov 22, 2024 •

edited

Loading

epic: Improve Cortex Engine Management #1416

epic: Improve Cortex Engine Management #1416

Comments

gabrielle-ong commented Oct 3, 2024 • edited Loading

Goal

Tasklist

Open Questions

Appendix

dan-menlo commented Oct 13, 2024

User Journey

namchuai commented Oct 21, 2024

Approach

File system structure

Tasks breakdown

Edge cases

dan-menlo commented Oct 24, 2024 • edited Loading

Cortex handles Engine Variants

Tasklist

Questions

Scenario

Option 1: Every engine is versioned, and maintains a list of variants that it can use

Option 2: Every engine version/variant is a first-class Engine citizen

dan-menlo commented Oct 24, 2024 • edited Loading

Cortex handles Engine Versions

Tasklist

Design

API

CLI

Release Management

namchuai commented Oct 24, 2024 • edited Loading

Engine install

Requirements

Sample output

dan-menlo commented Oct 30, 2024 • edited Loading

namchuai commented Oct 30, 2024 • edited Loading

Engine Management API Documentation

Basic Engine Operations

Install Engine Variant

Uninstall Engine Variant

List Installed Engine Variants

Release Information

List Released Engine Versions

List Released Engine Variants

Default Engine Management

Get Default Engine Variant

Set Default Engine Variant

Engine Runtime Operations

Load Engine

Unload Engine

Update Engine

List All Engines and Variants

Get All Engines

gabrielle-ong commented Nov 5, 2024

TC117 commented Nov 7, 2024 • edited Loading

TC117 commented Nov 8, 2024 • edited Loading

TC117 commented Nov 12, 2024

gabrielle-ong commented Nov 13, 2024

namchuai commented Nov 13, 2024

gabrielle-ong commented Nov 15, 2024

TC117 commented Nov 15, 2024 • edited Loading

gabrielle-ong commented Nov 18, 2024

gabrielle-ong commented Nov 22, 2024 • edited Loading

gabrielle-ong commented Oct 3, 2024 •

edited

Loading

dan-menlo commented Oct 24, 2024 •

edited

Loading

dan-menlo commented Oct 24, 2024 •

edited

Loading

namchuai commented Oct 24, 2024 •

edited

Loading

dan-menlo commented Oct 30, 2024 •

edited

Loading

namchuai commented Oct 30, 2024 •

edited

Loading

TC117 commented Nov 7, 2024 •

edited

Loading

TC117 commented Nov 8, 2024 •

edited

Loading

TC117 commented Nov 15, 2024 •

edited

Loading

gabrielle-ong commented Nov 22, 2024 •

edited

Loading