Skip to content

Commit e8801bd

Browse files
authored
Degraded but fully working cache-system when symlinks are not supported (#1067)
* first draft to avoid symlinks on windows * refacto how to support symlinks * support non-symlink in scan cache * add tests for the case symlinks are not handled * move warning message + add test * add test for scan cache without symlink * add documentation * update doc url in warning * add test to delete cache as well * remove useless comment * remove useless code * comment
1 parent df90bdd commit e8801bd

File tree

6 files changed

+311
-37
lines changed

6 files changed

+311
-37
lines changed

docs/source/how-to-cache.mdx

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,25 @@ In practice, your cache should look like the following tree:
109109
└── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
110110
```
111111

112+
### Limitations
113+
114+
In order to have an efficient cache-system, `huggingface-hub` uses symlinks. However,
115+
symlinks are not supported on all machines. This is a known limitation especially on
116+
Windows. When this is the case, `huggingface_hub` do not use the `blobs/` directory but
117+
directly stores the files in the `snapshots/` directory instead. This workaround allows
118+
users to download and cache files from the Hub exactly the same way. Tools to inspect
119+
and delete the cache (see below) are also supported. However, the cache-system is less
120+
efficient as a single file might be downloaded several times if multiple revisions of
121+
the same repo is downloaded.
122+
123+
If you want to benefit from the symlink-based cache-system on a Windows machine, you
124+
either need to [activate Developer Mode](https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development)
125+
or to run Python as an administrator.
126+
127+
When symlinks are not supported, a warning message is displayed to the user to alert
128+
them they are using a degraded version of the cache-system. This warning can be disabled
129+
by setting the `DISABLE_SYMLINKS_WARNING` environment variable to true.
130+
112131
## Scan your cache
113132

114133
At the moment, cached files are never deleted from your local directory: when you download

src/huggingface_hub/file_download.py

Lines changed: 64 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import json
55
import os
66
import re
7+
import shutil
78
import sys
89
import tempfile
910
import warnings
@@ -171,6 +172,50 @@ def get_jinja_version():
171172
return _jinja_version
172173

173174

175+
_are_symlinks_supported: Optional[bool] = None
176+
177+
178+
def are_symlinks_supported() -> bool:
179+
# Check symlink compatibility only once at first time use
180+
global _are_symlinks_supported
181+
182+
if _are_symlinks_supported is None:
183+
_are_symlinks_supported = True
184+
185+
with tempfile.TemporaryDirectory() as tmpdir:
186+
src_path = Path(tmpdir) / "dummy_file_src"
187+
src_path.touch()
188+
dst_path = Path(tmpdir) / "dummy_file_dst"
189+
try:
190+
os.symlink(src_path, dst_path)
191+
except OSError:
192+
# Likely running on Windows
193+
_are_symlinks_supported = False
194+
195+
if not os.environ.get("DISABLE_SYMLINKS_WARNING"):
196+
message = (
197+
"`huggingface_hub` cache-system uses symlinks by default to"
198+
" efficiently store duplicated files but your machine doesn't"
199+
" support them. Caching files will still work but in a degraded"
200+
" version that might require more space on your disk. This"
201+
" warning can be disabled by setting the"
202+
" `DISABLE_SYMLINKS_WARNING` environment variable. For more"
203+
" details, see"
204+
" https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations."
205+
)
206+
if os.name == "nt":
207+
message += (
208+
"\nTo support symlinks on Windows, you either need to"
209+
" activate Developer Mode or to run Python as an"
210+
" administrator. In order to see activate developer mode,"
211+
" see this article:"
212+
" https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development"
213+
)
214+
warnings.warn(message)
215+
216+
return _are_symlinks_supported
217+
218+
174219
# Return value when trying to load a file from cache but the file does not exist in the distant repo.
175220
_CACHED_NO_EXIST = object()
176221
REGEX_COMMIT_HASH = re.compile(r"^[0-9a-f]{40}$")
@@ -848,7 +893,7 @@ def _normalize_etag(etag: Optional[str]) -> Optional[str]:
848893
return etag.strip('"')
849894

850895

851-
def _create_relative_symlink(src: str, dst: str) -> None:
896+
def _create_relative_symlink(src: str, dst: str, new_blob: bool = False) -> None:
852897
"""Create a symbolic link named dst pointing to src as a relative path to dst.
853898
854899
The relative part is mostly because it seems more elegant to the author.
@@ -858,25 +903,29 @@ def _create_relative_symlink(src: str, dst: str) -> None:
858903
├── [ 128] 2439f60ef33a0d46d85da5001d52aeda5b00ce9f
859904
│ ├── [ 52] README.md -> ../../blobs/d7edf6bd2a681fb0175f7735299831ee1b22b812
860905
│ └── [ 76] pytorch_model.bin -> ../../blobs/403450e234d65943a7dcf7e05a771ce3c92faa84dd07db4ac20f592037a1e4bd
906+
907+
If symlinks cannot be created on this platform (most likely to be Windows), the
908+
workaround is to avoid symlinks by having the actual file in `dst`. If it is a new
909+
file (`new_blob=True`), we move it to `dst`. If it is not a new file
910+
(`new_blob=False`), we don't know if the blob file is already referenced elsewhere.
911+
To avoid breaking existing cache, the file is duplicated on the disk.
912+
913+
In case symlinks are not supported, a warning message is displayed to the user once
914+
when loading `huggingface_hub`. The warning message can be disable with the
915+
`DISABLE_SYMLINKS_WARNING` environment variable.
861916
"""
862917
relative_src = os.path.relpath(src, start=os.path.dirname(dst))
863918
try:
864919
os.remove(dst)
865920
except OSError:
866921
pass
867-
try:
922+
923+
if are_symlinks_supported():
868924
os.symlink(relative_src, dst)
869-
except OSError:
870-
# Likely running on Windows
871-
if os.name == "nt":
872-
raise OSError(
873-
"Windows requires Developer Mode to be activated, or to run Python as "
874-
"an administrator, in order to create symlinks.\nIn order to "
875-
"activate Developer Mode, see this article: "
876-
"https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development"
877-
)
878-
else:
879-
raise
925+
elif new_blob:
926+
os.replace(src, dst)
927+
else:
928+
shutil.copyfile(src, dst)
880929

881930

882931
def _cache_commit_hash_for_specific_revision(
@@ -1246,7 +1295,7 @@ def hf_hub_download(
12461295
if os.path.exists(blob_path) and not force_download:
12471296
# we have the blob already, but not the pointer
12481297
logger.info("creating pointer to %s from %s", blob_path, pointer_path)
1249-
_create_relative_symlink(blob_path, pointer_path)
1298+
_create_relative_symlink(blob_path, pointer_path, new_blob=False)
12501299
return pointer_path
12511300

12521301
# Prevent parallel downloads of the same file with a lock.
@@ -1302,7 +1351,7 @@ def _resumable_file_manager() -> "io.BufferedWriter":
13021351
os.replace(temp_file.name, blob_path)
13031352

13041353
logger.info("creating pointer to %s from %s", blob_path, pointer_path)
1305-
_create_relative_symlink(blob_path, pointer_path)
1354+
_create_relative_symlink(blob_path, pointer_path, new_blob=True)
13061355

13071356
try:
13081357
os.remove(lock_path)

src/huggingface_hub/utils/_cache_manager.py

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -631,7 +631,6 @@ def _scan_cached_repo(repo_path: Path) -> CachedRepoInfo:
631631

632632
snapshots_path = repo_path / "snapshots"
633633
refs_path = repo_path / "refs"
634-
blobs_path = repo_path / "blobs"
635634

636635
if not snapshots_path.exists() or not snapshots_path.is_dir():
637636
raise CorruptedCacheException(
@@ -679,22 +678,12 @@ def _scan_cached_repo(repo_path: Path) -> CachedRepoInfo:
679678
if file_path.is_dir():
680679
continue
681680

682-
if not file_path.is_symlink():
683-
raise CorruptedCacheException(
684-
f"Revision folder corrupted. Found a non-symlink file: {file_path}"
685-
)
686-
687681
blob_path = Path(file_path).resolve()
688682
if not blob_path.exists():
689683
raise CorruptedCacheException(
690684
f"Blob missing (broken symlink): {blob_path}"
691685
)
692686

693-
if blobs_path not in blob_path.parents:
694-
raise CorruptedCacheException(
695-
f"Blob symlink points outside of blob directory: {blob_path}"
696-
)
697-
698687
if blob_path not in blob_stats:
699688
blob_stats[blob_path] = blob_path.stat()
700689

tests/conftest.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,32 @@
1+
from pathlib import Path
2+
from tempfile import TemporaryDirectory
13
from typing import Generator
24

35
import pytest
46

7+
from _pytest.fixtures import SubRequest
58
from huggingface_hub import HfFolder
69

710

11+
@pytest.fixture
12+
def fx_cache_dir(request: SubRequest) -> Generator[None, None, None]:
13+
"""Add a `cache_dir` attribute pointing to a temporary directory in tests.
14+
15+
Example:
16+
```py
17+
@pytest.mark.usefixtures("fx_cache_dir")
18+
class TestWithCache(unittest.TestCase):
19+
cache_dir: Path
20+
21+
def test_cache_dir(self) -> None:
22+
self.assertTrue(self.cache_dir.is_dir())
23+
```
24+
"""
25+
with TemporaryDirectory() as cache_dir:
26+
request.cls.cache_dir = Path(cache_dir).resolve()
27+
yield
28+
29+
830
@pytest.fixture(autouse=True, scope="session")
931
def clean_hf_folder_token_for_tests() -> Generator:
1032
"""Clean token stored on machine before all tests and reset it back at the end.

0 commit comments

Comments
 (0)