Skip to content

Commit 0c3f25a

Browse files
committed
Merge branch 'main' of github.com:huggingface/new-model-addition-meta into final-version
2 parents 90d5876 + 3249c5d commit 0c3f25a

File tree

437 files changed

+13763
-6063
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

437 files changed

+13763
-6063
lines changed

.github/ISSUE_TEMPLATE/bug-report.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,11 @@ body:
4848
- pipelines: @Rocketknight1
4949
- tensorflow: @gante and @Rocketknight1
5050
- tokenizers: @ArthurZucker and @itazap
51-
- trainer: @muellerzr @SunMarc
51+
- trainer: @zach-huggingface @SunMarc
5252
5353
Integrations:
5454
55-
- deepspeed: HF Trainer/Accelerate: @muellerzr
55+
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
5656
- ray/raytune: @richardliaw, @amogkam
5757
- Big Model Inference: @SunMarc
5858
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,12 @@ Library:
5151
- pipelines: @Rocketknight1
5252
- tensorflow: @gante and @Rocketknight1
5353
- tokenizers: @ArthurZucker
54-
- trainer: @muellerzr and @SunMarc
54+
- trainer: @zach-huggingface and @SunMarc
5555
- chat templates: @Rocketknight1
5656
5757
Integrations:
5858
59-
- deepspeed: HF Trainer/Accelerate: @muellerzr
59+
- deepspeed: HF Trainer/Accelerate: @SunMarc @zach-huggingface
6060
- ray/raytune: @richardliaw, @amogkam
6161
- Big Model Inference: @SunMarc
6262
- quantization (bitsandbytes, autogpt): @SunMarc @MekkCyber

.github/scripts/codeowners_for_review_action

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ docs/ @stevhliu
1414
# Owners of subsections of the library
1515
/src/transformers/generation/ @gante
1616
/src/transformers/pipeline/ @Rocketknight1 @yonigozlan
17-
/src/transformers/integrations/ @SunMarc @MekkCyber @muellerzr
17+
/src/transformers/integrations/ @SunMarc @MekkCyber @zach-huggingface
1818
/src/transformers/quantizers/ @SunMarc @MekkCyber
1919
tests/ @ydshieh
2020
tests/generation/ @gante
@@ -27,8 +27,8 @@ tests/generation/ @gante
2727
# Specific files come after the sections/globs, so they take priority
2828
/.circleci/config.yml @ArthurZucker @ydshieh
2929
/utils/tests_fetcher.py @ydshieh
30-
trainer.py @muellerzr @SunMarc
31-
trainer_utils.py @muellerzr @SunMarc
30+
trainer.py @zach-huggingface @SunMarc
31+
trainer_utils.py @zach-huggingface @SunMarc
3232
/utils/modular_model_converter.py @Cyrilvallez @ArthurZucker
3333

3434
# Owners of individual models are specific / high priority, and so they come last

benchmark/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ def run_benchmark(logger: Logger, branch: str, commit_id: str, commit_msg: str,
1212

1313
## Writing metrics to the database
1414

15-
`MetricRecorder` is thread-safe, in the sense of the python [`Thread`](https://docs.python.org/3/library/threading.html#threading.Thread). This means you can start a background thread to do the readings on the device measurements while not blocking the main thread to execute the model measurements.
15+
`MetricsRecorder` is thread-safe, in the sense of the python [`Thread`](https://docs.python.org/3/library/threading.html#threading.Thread). This means you can start a background thread to do the readings on the device measurements while not blocking the main thread to execute the model measurements.
1616

1717
cf [`llama.py`](./llama.py) to see an example of this in practice.
1818

benchmark/benchmarks_entrypoint.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
import logging
44
import os
55
from typing import Dict
6-
import psycopg2
76
import sys
87

98
from psycopg2.extras import Json

benchmark/llama.py

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
204204
time_to_first_token = end - start
205205
logger.info(f"completed first compile generation in: {time_to_first_token}s")
206206
cache_position += 1
207-
all_generated_tokens += next_token.clone().detach().cpu().tolist()
207+
all_generated_tokens += next_token.tolist()
208208

209209
cache_position = torch.tensor([seq_length], device=device)
210210
### First compile, decoding
@@ -215,9 +215,9 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
215215
torch.cuda.synchronize()
216216
end = perf_counter()
217217
time_to_second_token = end - start
218-
logger.info(f"completed second compile generation in: {time_to_first_token}s")
218+
logger.info(f"completed second compile generation in: {time_to_second_token}s")
219219
cache_position += 1
220-
all_generated_tokens += next_token.clone().detach().cpu().tolist()
220+
all_generated_tokens += next_token.tolist()
221221

222222
### Second compile, decoding
223223
start = perf_counter()
@@ -227,15 +227,15 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
227227
torch.cuda.synchronize()
228228
end = perf_counter()
229229
time_to_third_token = end - start
230-
logger.info(f"completed third compile forward in: {time_to_first_token}s")
230+
logger.info(f"completed third compile forward in: {time_to_third_token}s")
231231
cache_position += 1
232-
all_generated_tokens += next_token.clone().detach().cpu().tolist()
232+
all_generated_tokens += next_token.tolist()
233233

234234
### Using cuda graphs decoding
235235

236236
start = perf_counter()
237237
for _ in range(1, num_tokens_to_generate):
238-
all_generated_tokens += next_token.clone().detach().cpu().tolist()
238+
all_generated_tokens += next_token.tolist()
239239
next_token = decode_one_token(
240240
model, next_token.clone(), cache_position=cache_position, past_key_values=past_key_values
241241
)
@@ -298,7 +298,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
298298
output = model.generate(**inputs, past_key_values=past_key_values)
299299
end = perf_counter()
300300
third_compile_generate_time = end - start
301-
logger.info(f"completed second compile generation in: {third_compile_generate_time}s")
301+
logger.info(f"completed third compile generation in: {third_compile_generate_time}s")
302302
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
303303

304304
past_key_values = StaticCache(
@@ -313,7 +313,7 @@ def decode_one_token(model, cur_token, cache_position, past_key_values):
313313
output = model.generate(**inputs, past_key_values=past_key_values)
314314
end = perf_counter()
315315
fourth_compile_generate_time = end - start
316-
logger.info(f"completed second compile generation in: {fourth_compile_generate_time}s")
316+
logger.info(f"completed fourth compile generation in: {fourth_compile_generate_time}s")
317317
logger.info(f"generated: {tokenizer.batch_decode(output.cpu().tolist())}")
318318

319319
metrics_recorder.collect_model_measurements(

conftest.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,10 +46,6 @@
4646
"test_keep_in_fp32_modules",
4747
"test_gradient_checkpointing_backward_compatibility",
4848
"test_gradient_checkpointing_enable_disable",
49-
"test_save_load_fast_init_from_base",
50-
"test_fast_init_context_manager",
51-
"test_fast_init_tied_embeddings",
52-
"test_save_load_fast_init_to_base",
5349
"test_torch_save_load",
5450
"test_initialization",
5551
"test_forward_signature",

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,8 @@
415415
title: DeBERTa
416416
- local: model_doc/deberta-v2
417417
title: DeBERTa-v2
418+
- local: model_doc/deepseek_v3
419+
title: DeepSeek-V3
418420
- local: model_doc/dialogpt
419421
title: DialoGPT
420422
- local: model_doc/diffllama
@@ -603,6 +605,10 @@
603605
title: Qwen2
604606
- local: model_doc/qwen2_moe
605607
title: Qwen2MoE
608+
- local: model_doc/qwen3
609+
title: Qwen3
610+
- local: model_doc/qwen3_moe
611+
title: Qwen3MoE
606612
- local: model_doc/rag
607613
title: RAG
608614
- local: model_doc/realm

docs/source/en/attention_interface.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,13 @@ supported models.
2323
Most recent models can now switch from one attention function used in the Attention layer to the other, thanks to a simple mapping.
2424
By default, we provide the implementation for [`sdpa`](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html),
2525
[`flash_attention_2`](https://github.com/Dao-AILab/flash-attention) and [`flex_attention`](https://pytorch.org/docs/stable/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention)
26-
as well as `eager`, which is simple matrix multiplication without any optimization on top.
26+
as well as `eager`, which is a simple matrix multiplication without any optimization on top.
2727
This is the setting you can usually choose when instantiating a model:
2828

2929
```python
3030
from transformers import AutoModelForCausalLM
3131

32-
model_id = "meta-llama/Llama-3.2-1B
32+
model_id = "meta-llama/Llama-3.2-1B"
3333

3434
# Here, using flash attention as an example
3535
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="flash_attention_2")
@@ -43,7 +43,7 @@ from transformers import AutoModelForCausalLM, AttentionInterface
4343
from transformers.integrations.sdpa_attention import sdpa_attention_forward
4444
import torch
4545

46-
model_id = "meta-llama/Llama-3.2-1B
46+
model_id = "meta-llama/Llama-3.2-1B"
4747

4848
def my_new_sdpa(*args, **kwargs):
4949
print("I just entered the attention computation")
@@ -56,7 +56,7 @@ model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="my_n
5656
model(torch.ones(1, 5, dtype=int))
5757
```
5858

59-
You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times.
59+
You will see it prints "I just entered the attention computation" as many times as there are layers in the model (with this example, 16 times).
6060

6161
## Dynamically switching attention function
6262

@@ -70,12 +70,12 @@ model(torch.ones(1, 5, dtype=int))
7070
```
7171

7272
and it will stop printing the statements, as it now uses the `sdpa` attention.
73-
This allows to quickly change attention function, without needing to reload the model!
73+
This allows to quickly change an attention function, without needing to reload the model!
7474

75-
## What about new args needed in my custom function?
75+
## What about new args needed in my custom attention function?
7676

7777
But indeed, what if the new function requires a new arg to be properly used? It's no issue! Models supporting the
78-
`AttentionInterface` propagates kwargs all the way to the Attention layers, and to the attention function used. That way,
78+
`AttentionInterface` propagate kwargs all the way to the Attention layers, and to the used attention function. That way,
7979
you can simply pass the arg (as a kwargs, i.e. you need to qualify the name of the arg) in the model's forward, and it will be correctly used in the attention. However, custom attention functions have some limitations. In particular, it must follow the signature and return format of other attention functions, i.e.
8080

8181
```python
@@ -103,4 +103,26 @@ model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation="cust
103103
model(torch.ones(1, 5, dtype=int), a_new_kwargs=..., another_new_kwargs=...)
104104
```
105105

106-
If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
106+
If in doubt about what args/kwargs a given model sends to the attention function, simply check that model's modeling code on [GitHub](https://github.com/huggingface/transformers/tree/main/src/transformers/models)!
107+
108+
## Accessing current available implementations
109+
110+
Most of the time, you will simply need to `register` a new function. If, however, you need to access an existing one,
111+
and/or perform a few checks, the prefered way is to use the global `ALL_ATTENTION_FUNCTIONS`. It behaves the same way you
112+
would expect from a usual Python dictionary:
113+
114+
```python
115+
>>> from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
116+
117+
>>> list(ALL_ATTENTION_FUNCTIONS.keys())
118+
>>> ['flash_attention_2', 'flex_attention', 'sdpa']
119+
120+
>>> ALL_ATTENTION_FUNCTIONS["sdpa"]
121+
>>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
122+
123+
>>> ALL_ATTENTION_FUNCTIONS.get("sdpa", None)
124+
>>> <function transformers.integrations.sdpa_attention.sdpa_attention_forward>
125+
126+
# You can also globally `register` a new function directly on it
127+
>>> ALL_ATTENTION_FUNCTIONS.register("new_func", new_func)
128+
```

docs/source/en/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,3 @@ Transformers is designed for developers and machine learning engineers and resea
4343
</a>
4444
</div>
4545

46-
Join us on the Hugging Face [Hub](https://huggingface.co/), [Discord](https://discord.com/invite/JfAtkvEtRb), or [forum](https://discuss.huggingface.co/) to collaborate and build models, datasets, and applications together.

0 commit comments

Comments
 (0)