Skip to content

Refactor turbomind (low-level abstractions) #3423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Apr 22, 2025
Merged

Conversation

lzhangzz
Copy link
Collaborator

No description provided.

@lvhan028
Copy link
Collaborator

Evaluation results on some chat models

dataset version metric mode qwen2.5-0.5b-instruct-turbomind-wo qwen2.5-7b-instruct-turbomind-wo qwen2.5-32b-instruct-turbomind-wo qwen2.5-72b-instruct-turbomind-wo internlm2_5-7b-chat-turbomind-wo llama-3_1-8b-instruct-turbomind-wo
Language - - - - - - - - -
race-high bd3f33 accuracy gen 41.91 85.19 91.08 90.74 86.62 82.22
ARC-c 926652 accuracy gen 42.71 90.51 95.25 96.61 88.81 87.46
BoolQ 1d56df accuracy gen 59.02 86.48 87.95 89.27 88.26 85.99
triviaqa_wiki_1shot bc5f21 score gen 30.81 53.27 61.35 72.54 65.16 79.43
nq_open_1shot 2e45e5 score gen 8.50 17.48 20.89 27.20 22.63 34.90
mmmlu_lite - naive_average gen 27.98 54.89 60.59 68.99 44.49 43.77
- - - - - - - - -
General Reasoning - - - - - - - - -
drop 3857b0 accuracy gen 34.38 80.25 88.08 87.55 77.29 81.45
bbh - naive_average gen 30.23 68.16 84.49 84.30 73.45 67.93
GPQA_diamond 5aeece accuracy gen 12.63 37.37 45.45 50.00 31.31 25.76
hellaswag e42710 accuracy gen 39.63 85.40 92.21 92.71 94.82 76.98
TheoremQA - - - - - - - - -
musr_average - naive_average gen 34.86 42.12 53.23 48.80 49.59 54.32
korbench_single - naive_average gen 11.20 41.44 56.24 52.88 32.72 44.96
ARC_Prize_Public_Evaluation 872059 accuracy gen 0.00 0.00 0.06 0.09 0.01 0.01
- - - - - - - - -
Math Calculation - - - - - - - - -
gsm8k 6e39a4 accuracy gen 47.92 92.65 95.38 95.30 87.11 84.46
GaokaoBench - weighted_average gen 32.34 80.69 90.41 90.46 78.25 49.58
math 11c4b5 accuracy gen 32.86 73.60 80.96 81.64 61.04 49.52
cmo_fib ace24b accuracy gen 1.92 24.52 31.25 33.65 12.98 5.77
aime2024 6e39a4 accuracy gen 0.00 13.33 23.33 26.67 3.33 3.33
Mathbench - naive_average gen 19.25 77.42 84.62 85.03 64.10 51.92
- - - - - - - - -
Knowledge - - - - - - - - -
wikibench-wiki-single_choice_cncircular 0978ad perf_4 gen 7.60 34.95 44.70 49.85 31.45 26.90
cmmlu - naive_average gen 31.59 75.83 82.91 84.85 74.51 53.69
mmlu - naive_average gen 34.86 76.31 84.24 86.38 70.69 71.71
mmlu_pro - naive_average gen 13.65 56.19 68.20 71.40 45.27 48.30
- - - - - - - - -
mmlu - naive_average gen 34.86 76.31 84.24 86.38 70.69 71.71
mmlu-stem - naive_average gen 30.27 78.26 86.04 87.83 68.84 67.58
mmlu-social-science - naive_average gen 39.91 79.03 87.15 88.63 75.85 77.15
mmlu-humanities - naive_average gen 34.66 72.13 82.67 85.51 68.33 71.81
mmlu-other - naive_average gen 37.09 75.12 80.48 83.04 70.97 72.62
- - - - - - - - -
cmmlu - naive_average gen 31.59 75.83 82.91 84.85 74.51 53.69
cmmlu-stem - naive_average gen 24.78 72.84 80.49 82.44 66.46 47.99
cmmlu-social-science - naive_average gen 31.93 75.58 83.19 84.54 75.59 53.51
cmmlu-humanities - naive_average gen 34.72 76.89 84.10 86.62 78.95 55.63
cmmlu-other - naive_average gen 36.11 78.68 84.21 86.51 78.20 58.72
cmmlu-china-specific - naive_average gen 32.38 72.70 80.79 82.93 73.48 48.02
- - - - - - - - -
mmlu_pro - naive_average gen 13.65 56.19 68.20 71.40 45.27 48.30
mmlu_pro_biology 58fe7c accuracy gen 26.08 73.36 84.10 82.71 68.62 68.76
mmlu_pro_business 58fe7c accuracy gen 10.52 68.06 75.03 78.83 49.43 52.34
mmlu_pro_chemistry 58fe7c accuracy gen 7.16 57.86 70.32 73.85 35.78 38.69
mmlu_pro_computer_science 58fe7c accuracy gen 12.68 60.00 72.20 76.59 48.29 48.54
mmlu_pro_economics 58fe7c accuracy gen 22.51 64.81 76.90 78.91 56.64 58.29
mmlu_pro_engineering 58fe7c accuracy gen 6.09 42.62 56.14 57.17 30.55 26.42
mmlu_pro_health 58fe7c accuracy gen 13.94 54.77 66.26 71.03 46.58 57.58
mmlu_pro_history 58fe7c accuracy gen 10.76 46.98 58.79 64.57 41.47 46.98
mmlu_pro_law 58fe7c accuracy gen 11.53 27.88 41.60 47.77 23.89 32.52
mmlu_pro_math 58fe7c accuracy gen 10.51 73.21 81.87 83.12 54.18 52.04
mmlu_pro_philosophy 58fe7c accuracy gen 13.83 44.29 59.12 62.12 38.48 41.28
mmlu_pro_physics 58fe7c accuracy gen 8.93 57.66 72.44 74.52 37.11 42.34
mmlu_pro_psychology 58fe7c accuracy gen 22.81 62.28 75.06 77.32 57.64 61.78
mmlu_pro_other 58fe7c accuracy gen 13.74 52.81 64.94 71.10 45.13 48.59
- - - - - - - - -
mmmlu_lite - naive_average gen 27.98 54.89 60.59 68.99 44.49 43.77
openai_mmmlu_lite_AR-XY 07891e accuracy gen 22.67 38.81 4.70 47.51 17.05 19.09
openai_mmmlu_lite_BN-BD 0c33f9 accuracy gen 9.12 42.25 12.77 68.63 27.23 29.05
openai_mmmlu_lite_DE-DE 2f4124 accuracy gen 31.58 60.21 74.95 75.72 50.88 26.46
openai_mmmlu_lite_ES-LA 555bbc accuracy gen 34.25 66.53 76.21 76.91 56.77 56.84
openai_mmmlu_lite_FR-FR 97f4e3 accuracy gen 34.39 67.09 78.25 79.37 58.18 56.70
openai_mmmlu_lite_HI-IN 94cd28 accuracy gen 19.09 50.25 65.05 73.75 30.11 51.30
openai_mmmlu_lite_ID-ID b65293 accuracy gen 28.42 61.68 75.44 77.26 50.81 53.54
openai_mmmlu_lite_IT-IT bf8b68 accuracy gen 33.19 65.33 77.05 79.44 50.74 56.63
openai_mmmlu_lite_JA-JP 45635b accuracy gen 29.68 61.12 73.96 77.75 50.39 53.40
openai_mmmlu_lite_KO-KR bd4d2a accuracy gen 29.89 61.19 72.70 74.32 43.86 51.65
openai_mmmlu_lite_PT-BR 0b103d accuracy gen 28.84 54.88 71.16 66.60 57.40 24.28
openai_mmmlu_lite_SW-KE 75318c accuracy gen 26.95 36.49 49.40 49.40 32.49 43.30
openai_mmmlu_lite_YO-NG 75318c accuracy gen 25.82 32.49 37.75 39.44 31.79 31.86
openai_mmmlu_lite_ZH-CN 14e7b8 accuracy gen 37.82 70.11 78.81 79.72 65.12 58.67
- - - - - - - - -
###### MathBench-A: Application Part ###### - - - - - - - - -
college - naive_average gen 4.00 47.33 66.33 65.00 19.33 16.00
high - naive_average gen 13.33 59.67 70.67 70.00 47.33 28.33
middle - naive_average gen 18.67 78.33 84.67 87.33 59.67 43.67
primary - naive_average gen 20.00 84.33 87.33 88.33 73.33 65.00
arithmetic - naive_average gen 36.00 75.00 79.33 82.33 58.00 59.00
mathbench-a (average) - naive_average gen 18.40 68.93 77.67 78.60 51.53 42.40
###### MathBench-T: Theory Part ###### - - - - - - - - -
college_knowledge - naive_average gen 13.29 83.70 90.35 90.66 68.04 62.34
high_knowledge - naive_average gen 14.04 80.04 87.78 88.46 70.87 54.12
middle_knowledge - naive_average gen 22.41 85.55 92.86 91.22 79.26 59.42
primary_knowledge - naive_average gen 30.70 94.34 95.28 95.52 88.49 69.89
mathbench-t (average) - naive_average gen 20.11 85.91 91.57 91.47 76.66 61.44

@lvhan028 lvhan028 merged commit c6e7fd5 into InternLM:main Apr 22, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants