Skip to content

Commit 78c3235

Browse files
author
twitter-team
committed
Twitter's Recommendation Algorithm - Heavy Ranker and TwHIN embeddings
0 parents  commit 78c3235

File tree

111 files changed

+11876
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+11876
-0
lines changed

.github/workflows/main.yml

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: Python package
2+
3+
on: [push]
4+
5+
jobs:
6+
build:
7+
runs-on: ubuntu-latest
8+
strategy:
9+
matrix:
10+
python-version: ["3.10"]
11+
12+
steps:
13+
- uses: actions/checkout@v3
14+
# - uses: pre-commit/[email protected]
15+
# name: Run pre-commit checks (pylint/yapf/isort)
16+
# env:
17+
# SKIP: insert-license
18+
# with:
19+
# extra_args: --hook-stage push --all-files
20+
- uses: actions/setup-python@v4
21+
with:
22+
python-version: "3.10"
23+
cache: "pip" # caching pip dependencies
24+
- name: install packages
25+
run: |
26+
/usr/bin/python -m pip install --upgrade pip
27+
pip install --no-deps -r images/requirements.txt
28+
# - name: ssh access
29+
# uses: lhotari/action-upterm@v1
30+
# with:
31+
# limit-access-to-actor: true
32+
# limit-access-to-users: arashd
33+
- name: run tests
34+
run: |
35+
# Environment variables are reset in between steps.
36+
mkdir /tmp/github_testing
37+
ln -s $GITHUB_WORKSPACE /tmp/github_testing/tml
38+
export PYTHONPATH="/tmp/github_testing:$PYTHONPATH"
39+
pytest -vv

.gitignore

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Mac
2+
.DS_Store
3+
4+
# Vim
5+
*.py.swp
6+
7+
# Byte-compiled / optimized / DLL files
8+
__pycache__/
9+
*.py[cod]
10+
11+
# C extensions
12+
*.so
13+
14+
# Distribution / packaging
15+
build/
16+
develop-eggs/
17+
dist/
18+
eggs/
19+
lib/
20+
lib64/
21+
parts/
22+
sdist/
23+
var/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
28+
# Installer logs
29+
pip-log.txt
30+
pip-delete-this-directory.txt
31+
32+
# Unit test / coverage reports
33+
.hypothesis
34+
35+
venv

.pre-commit-config.yaml

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
repos:
2+
- repo: https://github.com/pausan/cblack
3+
rev: release-22.3.0
4+
hooks:
5+
- id: cblack
6+
name: cblack
7+
description: "Black: The uncompromising Python code formatter - 2 space indent fork"
8+
entry: cblack . -l 100
9+
- repo: https://github.com/pre-commit/pre-commit-hooks
10+
rev: v2.3.0
11+
hooks:
12+
- id: trailing-whitespace
13+
- id: end-of-file-fixer
14+
- id: check-yaml
15+
- id: check-added-large-files
16+
- id: check-merge-conflict

COPYING

+661
Large diffs are not rendered by default.

LICENSE.torchrec

+33
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
A few files here (where it is specifically noted in comments) are based on code from torchrec but
2+
adapted for our use. Torchrec license is below:
3+
4+
5+
BSD 3-Clause License
6+
7+
Copyright (c) Meta Platforms, Inc. and affiliates.
8+
All rights reserved.
9+
10+
Redistribution and use in source and binary forms, with or without
11+
modification, are permitted provided that the following conditions are met:
12+
13+
* Redistributions of source code must retain the above copyright notice, this
14+
list of conditions and the following disclaimer.
15+
16+
* Redistributions in binary form must reproduce the above copyright notice,
17+
this list of conditions and the following disclaimer in the documentation
18+
and/or other materials provided with the distribution.
19+
20+
* Neither the name of the copyright holder nor the names of its
21+
contributors may be used to endorse or promote products derived from
22+
this software without specific prior written permission.
23+
24+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
25+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
27+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
28+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
30+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
31+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
32+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
33+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
This project open sources some of the ML models used at Twitter.
2+
3+
Currently these are:
4+
5+
1. The "For You" Heavy Ranker (projects/home/recap).
6+
7+
2. TwHIN embeddings (projects/twhin) https://arxiv.org/abs/2202.05387
8+
9+
10+
This project can be run inside a python virtualenv. We have only tried this on Linux machines and because we use torchrec it works best with an Nvidia GPU. To setup run
11+
12+
`./images/init_venv.sh` (Linux only).
13+
14+
The READMEs of each project contain instructions about how to run each project.

common/__init__.py

Whitespace-only changes.

common/batch.py

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
"""Extension of torchrec.dataset.utils.Batch to cover any dataset.
2+
"""
3+
# flake8: noqa
4+
from __future__ import annotations
5+
from typing import Dict
6+
import abc
7+
from dataclasses import dataclass
8+
import dataclasses
9+
10+
import torch
11+
from torchrec.streamable import Pipelineable
12+
13+
14+
class BatchBase(Pipelineable, abc.ABC):
15+
@abc.abstractmethod
16+
def as_dict(self) -> Dict:
17+
raise NotImplementedError
18+
19+
def to(self, device: torch.device, non_blocking: bool = False):
20+
args = {}
21+
for feature_name, feature_value in self.as_dict().items():
22+
args[feature_name] = feature_value.to(device=device, non_blocking=non_blocking)
23+
return self.__class__(**args)
24+
25+
def record_stream(self, stream: torch.cuda.streams.Stream) -> None:
26+
for feature_value in self.as_dict().values():
27+
feature_value.record_stream(stream)
28+
29+
def pin_memory(self):
30+
args = {}
31+
for feature_name, feature_value in self.as_dict().items():
32+
args[feature_name] = feature_value.pin_memory()
33+
return self.__class__(**args)
34+
35+
def __repr__(self) -> str:
36+
def obj2str(v):
37+
return f"{v.size()}" if hasattr(v, "size") else f"{v.length_per_key()}"
38+
39+
return "\n".join([f"{k}: {obj2str(v)}," for k, v in self.as_dict().items()])
40+
41+
@property
42+
def batch_size(self) -> int:
43+
for tensor in self.as_dict().values():
44+
if tensor is None:
45+
continue
46+
if not isinstance(tensor, torch.Tensor):
47+
continue
48+
return tensor.shape[0]
49+
raise Exception("Could not determine batch size from tensors.")
50+
51+
52+
@dataclass
53+
class DataclassBatch(BatchBase):
54+
@classmethod
55+
def feature_names(cls):
56+
return list(cls.__dataclass_fields__.keys())
57+
58+
def as_dict(self):
59+
return {
60+
feature_name: getattr(self, feature_name)
61+
for feature_name in self.feature_names()
62+
if hasattr(self, feature_name)
63+
}
64+
65+
@staticmethod
66+
def from_schema(name: str, schema):
67+
"""Instantiates a custom batch subclass if all columns can be represented as a torch.Tensor."""
68+
return dataclasses.make_dataclass(
69+
cls_name=name,
70+
fields=[(name, torch.Tensor, dataclasses.field(default=None)) for name in schema.names],
71+
bases=(DataclassBatch,),
72+
)
73+
74+
@staticmethod
75+
def from_fields(name: str, fields: dict):
76+
return dataclasses.make_dataclass(
77+
cls_name=name,
78+
fields=[(_name, _type, dataclasses.field(default=None)) for _name, _type in fields.items()],
79+
bases=(DataclassBatch,),
80+
)
81+
82+
83+
class DictionaryBatch(BatchBase, dict):
84+
def as_dict(self) -> Dict:
85+
return self

common/checkpointing/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
from tml.common.checkpointing.snapshot import get_checkpoint, Snapshot

0 commit comments

Comments
 (0)