Skip to content

POCA trainer #5005

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 289 commits into from
Mar 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
289 commits
Select commit Hold shift + click to select a range
62e9b45
Make the env easier
Dec 15, 2020
1ebacc1
Remove prints
Dec 15, 2020
cb57bf0
Make Collab env harder
Dec 17, 2020
95b3522
Fix group ID
Dec 18, 2020
afd7476
Add cc to ghost trainer
Dec 18, 2020
292b6ce
Add comment to ghost trainer
Dec 18, 2020
112a9dc
Revert "Add comment to ghost trainer"
Dec 18, 2020
783db4c
Actually add comment to ghosttrainer
Dec 18, 2020
6c4ba1e
Scale size of CC network
Dec 21, 2020
d314478
Scale value network based on num agents
Dec 21, 2020
c7adb93
Add 3rd symbol to hallway collab
Dec 21, 2020
d2e315d
Make comms one-hot
Dec 21, 2020
5cf76e3
Fix S tag
Dec 23, 2020
8708f70
Merge branch 'master' into develop-centralizedcritic-mm
Jan 4, 2021
44fb8b5
Additional changes
Jan 4, 2021
56f9dbf
Some more fixes
Jan 4, 2021
a468075
Self-attention Centralized Critic
Jan 6, 2021
db184d9
separate entity encoder and RSA
andrewcoh Jan 11, 2021
32cbdee
clean up args in mha
andrewcoh Jan 11, 2021
c90472c
more cleanups
andrewcoh Jan 11, 2021
d429b53
fixed tests
andrewcoh Jan 11, 2021
44093f2
Merge branch 'develop-attention-refactor' into develop-centralizedcri…
Jan 11, 2021
1dc0059
Merge branch 'develop-attention-refactor' into develop-centralizedcri…
Jan 11, 2021
2b5b994
entity embeddings work with no max
Jan 11, 2021
cd84fe3
remove group id
Jan 11, 2021
eed2fce
very rough sketch for TeamManager interface
Jan 8, 2021
fe41094
One layer for entity embed
Jan 12, 2021
3822b18
Use 4 heads
Jan 12, 2021
3f4b2b5
add defaults to linear encoder, initialize ent encoders
andrewcoh Jan 12, 2021
c7c7d4c
Merge branch 'master' into develop-centralizedcritic-mm
Jan 12, 2021
f391b35
Merge branch 'develop-lin-enc-def' into develop-centralizedcritic-mm
Jan 12, 2021
f706a91
add team manager id to proto
Jan 12, 2021
cee5466
team manager for hallway
Jan 12, 2021
195978c
add manager to hallway
Jan 12, 2021
10f336e
send and process team manager id
Jan 12, 2021
f0bf657
remove print
Jan 12, 2021
e03c79e
Merge branch 'develop-centralizedcritic-mm' into develop-cc-teammanager
Jan 12, 2021
1118089
small cleanup
Jan 13, 2021
13a90b1
default behavior for baseTeamManager
Jan 13, 2021
36d1b5b
add back statsrecorder
Jan 13, 2021
376d500
update
Jan 13, 2021
dd8b5fb
Team manager prototype (#4850)
Jan 13, 2021
8673820
Remove statsrecorder
Jan 13, 2021
fb86a57
Fix AgentProcessor for TeamManager
Jan 13, 2021
1beea7d
Merge branch 'develop-centralizedcritic-mm' into develop-cc-teammanager
Jan 13, 2021
9e69790
team manager
Jan 13, 2021
3c2b9d1
New buffer layout, TeamObsUtil, pad dead agents
Jan 14, 2021
b4b9d72
Use NaNs to get masks for attention
Jan 14, 2021
7d5f3e3
Add team reward to buffer
Jan 15, 2021
b7c5533
Try subtract marginalized value
Jan 15, 2021
53e1277
Add Q function with attention
Jan 20, 2021
2134004
Some more progress - still broken
Jan 20, 2021
60c6071
use singular entity embedding (#4873)
andrewcoh Jan 20, 2021
47cfae4
I think it's running
Jan 20, 2021
d31da21
Actions added but untested
Jan 21, 2021
541d062
Fix issue with team_actions
Jan 22, 2021
d3c4372
Add next action and next team obs
Jan 22, 2021
3407478
separate forward into q_net and baseline
andrewcoh Jan 22, 2021
f84ca50
Merge branch 'develop-centralizedcritic-counterfact' into develop-coma2
andrewcoh Jan 22, 2021
287c1b9
might be right
andrewcoh Jan 22, 2021
f73ef80
forcing this to work
andrewcoh Jan 22, 2021
10a416a
buffer error
andrewcoh Jan 22, 2021
e716199
COMAA runs
andrewcoh Jan 23, 2021
45349b8
add lambda return and target network
andrewcoh Jan 23, 2021
9a6474e
no target net
andrewcoh Jan 24, 2021
04d9617
remove normalize advantages
andrewcoh Jan 24, 2021
5bbb222
add target network back
andrewcoh Jan 24, 2021
2868694
value estimator
andrewcoh Jan 24, 2021
c9b4e71
update coma config
andrewcoh Jan 24, 2021
a10caaf
add target net
andrewcoh Jan 24, 2021
44c616d
no target, increase lambda
andrewcoh Jan 24, 2021
ef01af4
remove prints
andrewcoh Jan 24, 2021
f329e1d
cloud config
andrewcoh Jan 24, 2021
fbd1749
use v return
andrewcoh Jan 25, 2021
908b1df
use target net
andrewcoh Jan 25, 2021
d4073ce
adding zombie to coma2 brnch
andrewcoh Jan 25, 2021
7d8f2b5
add callbacks
andrewcoh Jan 25, 2021
9452239
cloud run with coma2 of held out zombie test env
andrewcoh Jan 25, 2021
39adec6
target of baseline is returns_v
andrewcoh Jan 26, 2021
14bb6fd
remove target update
andrewcoh Jan 26, 2021
7cb5dbc
Add team dones
Jan 26, 2021
761a206
ntegrate teammate dones
andrewcoh Jan 26, 2021
3afae60
add value clipping
andrewcoh Jan 26, 2021
f0dfada
try again on cloud
andrewcoh Jan 26, 2021
c3d8d8e
clipping values and updated zombie
andrewcoh Jan 27, 2021
c3d84c5
update configs
andrewcoh Jan 27, 2021
f5419aa
remove value head clipping
andrewcoh Jan 27, 2021
d7a2386
update zombie config
andrewcoh Jan 27, 2021
cdc6dde
Add trust region to COMA updates
Jan 29, 2021
4f35048
Remove Q-net for perf
Jan 29, 2021
05c8ea1
Weight decay, regularizaton loss
Jan 29, 2021
a7f2fc2
Use same network
Jan 29, 2021
6d2be2c
add base team manager
Feb 1, 2021
b812da4
Remove reg loss, still stable
Feb 4, 2021
0c3dbff
Black format
Feb 4, 2021
09590ad
add team reward field to agent and proto
Feb 5, 2021
c982c06
set team reward
Feb 5, 2021
7e3d976
add maxstep to teammanager and hook to academy
Feb 5, 2021
c40fec0
check agent by agent.enabled
Feb 8, 2021
ffb3f0b
remove manager from academy when dispose
Feb 9, 2021
f87cfbd
move manager
Feb 9, 2021
8b8e916
put team reward in decision steps
Feb 9, 2021
6b71f5a
use 0 as default manager id
Feb 9, 2021
87e97dd
fix setTeamReward
Feb 9, 2021
d3d1dc1
change method name to GetRegisteredAgents
Feb 9, 2021
2ba09ca
address comments
Feb 9, 2021
5587e48
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 9, 2021
7e51ad1
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 9, 2021
f25b171
Revert C# env changes
Feb 9, 2021
128b09b
Remove a bunch of stuff from envs
Feb 9, 2021
4690c4e
Remove a bunch of extra files
Feb 9, 2021
dbdd045
Remove changes from base-teammanager
Feb 9, 2021
30c846f
Remove remaining files
Feb 9, 2021
dd7f867
Remove some unneeded changes
Feb 9, 2021
f36f696
Make buffer typing neater
Feb 9, 2021
a1b7e75
AgentProcessor fixes
Feb 9, 2021
236f398
Back out trainer changes
Feb 9, 2021
96278d0
Separate Actor/Critic, remove ActorCritics
andrewcoh Feb 9, 2021
5f8cbc5
update policy to not use critic
andrewcoh Feb 9, 2021
293ec08
add critic to optimizer, ppo runs
andrewcoh Feb 9, 2021
7d20bd9
fix precommit errors
andrewcoh Feb 9, 2021
a22c621
use delegate to avoid agent-manager cyclic reference
Feb 9, 2021
c669226
fix test_networks
andrewcoh Feb 9, 2021
2dc90a9
put team reward in decision steps
Feb 9, 2021
70207a3
fix unregister agents
Feb 10, 2021
b22d0ae
Update SAC to use separate policy
Feb 10, 2021
49282f6
add teamreward to decision step
Feb 10, 2021
204b45b
typo
Feb 10, 2021
7eacfba
unregister on disabled
Feb 10, 2021
016ffd8
remove OnTeamEpisodeBegin
Feb 10, 2021
d7e2ca6
make critic a property
andrewcoh Feb 10, 2021
8b9d662
change name TeamManager to MultiAgentGroup
Feb 11, 2021
3fb14b9
more team -> group
Feb 11, 2021
4e4ecad
fix tests
Feb 11, 2021
492fd17
fix tests
Feb 11, 2021
9f6eca7
remove commented code
andrewcoh Feb 11, 2021
7292672
Merge remote-tracking branch 'origin/develop-base-teammanager' into d…
Feb 11, 2021
78e052b
Use attention tests from master
Feb 11, 2021
81d8389
Revert "Use attention tests from master"
Feb 11, 2021
39f92c3
Use attention from master
Feb 11, 2021
1d500d6
Renaming fest
Feb 11, 2021
6418e05
Use NamedTuples instead of attrs classes
Feb 11, 2021
944997a
fix saver test
andrewcoh Feb 11, 2021
527ca06
Move value network for SAC to device
Feb 11, 2021
eb15030
Merge remote-tracking branch 'origin/develop-critic-optimizer' into d…
Feb 11, 2021
4d215cf
add SharedActorCritic
andrewcoh Feb 11, 2021
9fac4b1
test for SharedActorCritic
andrewcoh Feb 11, 2021
d5a30f1
fix agent processor test
andrewcoh Feb 11, 2021
6da8dd3
Bug fixes
Feb 11, 2021
ad4a821
remove GroupMaxStep
Feb 12, 2021
9725aa5
add some doc
Feb 12, 2021
65b5992
fix sac shared
andrewcoh Feb 12, 2021
f5190fe
Fix mock brain
Feb 12, 2021
664ae89
np float32 fixes
Feb 12, 2021
8f696f4
more renaming
Feb 12, 2021
31da276
fix test policy
andrewcoh Feb 12, 2021
77557ca
Test for team obs in agentprocessor
Feb 12, 2021
6464cb6
Test for group and add team reward
Feb 12, 2021
cbfdfb3
doc improve
Feb 12, 2021
6badfb5
Merge branch 'master' into develop-base-teammanager
Feb 13, 2021
ef67f53
Merge branch 'master' into develop-base-teammanager
Feb 13, 2021
8e78dbd
Merge branch 'develop-base-teammanager' of https://github.com/Unity-T…
Feb 13, 2021
31ee1c4
store registered agents in set
Feb 16, 2021
1e4c837
remove unused step counts
Feb 17, 2021
cba26b2
Merge branch 'develop-base-teammanager' into develop-agentprocessor-t…
Feb 17, 2021
2113a43
Global group ids
Feb 17, 2021
ba9896c
coma trainer and optimizer
andrewcoh Feb 17, 2021
5679e2f
MultiInputNetBody
andrewcoh Feb 17, 2021
0e28c07
Fix Trajectory test
Feb 19, 2021
6936004
Merge branch 'master' into develop-agentprocessor-teammanager
Feb 23, 2021
97d1b80
Remove duplicated files
Feb 23, 2021
ce2e7b1
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
Feb 23, 2021
dbcf313
Running COMA (not sure if learning)
Feb 23, 2021
9a00053
Add team methods to AgentAction
Feb 23, 2021
fce4ad3
Right loss function for stability, fix some pypi
Feb 23, 2021
2c03d2b
Buffer fixes
Feb 23, 2021
f70c345
Group reward function
Feb 23, 2021
33a27e0
Add PushBlockCollab config and fix some stuff
Feb 24, 2021
b39e873
Fix Team Cumulative Reward
Feb 24, 2021
f879b61
Buffer fixes
Feb 23, 2021
f86e7b4
clean ups (#5003)
andrewcoh Feb 24, 2021
01ca5df
Merge branch 'master' into develop-coma2-trainer
andrewcoh Feb 24, 2021
6d7a604
Add test for GroupObs
Feb 24, 2021
587e3da
Change AgentAction back to 0 pad and add tests
Feb 24, 2021
fd4aa53
Addressed some comments
Feb 24, 2021
8dbea77
Address some comments
Feb 25, 2021
ec9e5ad
Add more comments
Feb 25, 2021
e1f48db
Rename internal function
Feb 25, 2021
d42896a
Move padding method to AgentBufferField
Feb 25, 2021
b3f2689
Merge branch 'main' into develop-agentprocessor-teammanager
Feb 25, 2021
7085461
checkout ppo/optimizer from main
andrewcoh Feb 25, 2021
7005daa
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
andrewcoh Feb 25, 2021
3658310
Fix slicing typing and string printing in AgentBufferField
Feb 25, 2021
b2100c1
Fix slicing typing and string printing in AgentBufferField
Feb 25, 2021
7b1f805
Fix to-flat and add tests
Feb 25, 2021
8096b11
clean ups
andrewcoh Feb 26, 2021
8359ca3
Merge branch 'develop-agentprocessor-teammanager' into develop-coma2-…
andrewcoh Feb 26, 2021
8c18a80
add inital coma optimizer tests
andrewcoh Feb 26, 2021
cc9f5c0
Faster NaN masking, fix masking for visual obs (#5015)
Feb 26, 2021
c34837c
get value estimate test
andrewcoh Mar 3, 2021
107bb3d
Merge branch 'main' into develop-coma2-trainer
Mar 4, 2021
4e82760
inital evaluate_by_seq, does not run
andrewcoh Mar 4, 2021
4b7db51
finished evaluate_by_seq, does not run
andrewcoh Mar 4, 2021
5905680
ignoring precommit, grabbing baseline/critic mems from buffer in trainer
andrewcoh Mar 4, 2021
d7d622a
lstm almost runs
andrewcoh Mar 4, 2021
7ec4b34
lstm runs with coma
andrewcoh Mar 4, 2021
277d66f
[coma2] Make group extrinsic reward part of extrinsic (#5033)
Mar 5, 2021
2a93ca1
[coma2] Add support for variable length obs in COMA2 (#5038)
Mar 5, 2021
743ede0
Fix pypi issues
Mar 5, 2021
edfdbdc
Action slice (#5047)
andrewcoh Mar 5, 2021
c15abe4
Merge branch 'main' into develop-coma2-trainer
Mar 5, 2021
1a01dd4
add torch no_grad to coma LSTM value computation
andrewcoh Mar 8, 2021
e12744a
torch coma tests: lstm, cur, gail
andrewcoh Mar 8, 2021
71da407
Fix warning message format
Mar 8, 2021
17bcd7f
Fix warning message formatting again
Mar 8, 2021
bf08535
[coma2] Group reward reporting fix (#5065)
Mar 9, 2021
a238fe4
Copy obs before removing NaNs (#5069)
Mar 10, 2021
1837968
Change comments on COMA2 trainer
Mar 10, 2021
d4fe1f2
Merge branch 'main' into develop-coma2-trainer
Mar 10, 2021
369aa86
Properly restore and test COMA2 optimizer
Mar 10, 2021
2071092
multiinput netbody tests
andrewcoh Mar 10, 2021
0a5092b
Merge branch 'develop-coma2-trainer' of https://github.com/Unity-Tech…
andrewcoh Mar 10, 2021
b6e70ce
cleanup some mypy types (#5072)
Mar 10, 2021
2c83696
coma -> poca
Mar 10, 2021
f243181
Rename folders and files
Mar 10, 2021
2df0296
Fix two more coma references
Mar 10, 2021
a7d2a65
Multiagent simplerl (#5066)
andrewcoh Mar 10, 2021
2d0ee89
add docstrings to network body
andrewcoh Mar 10, 2021
8046811
ource /Users/ervin/.virtualenvs/mlagents-38/bin/activate
Mar 10, 2021
ab6b1d5
Update tests
Mar 10, 2021
0f4201a
rename to MultiAgentNetwork, docstring
andrewcoh Mar 10, 2021
56548dd
fix references to ppo
andrewcoh Mar 10, 2021
3b91d38
docstrings to poca optimizer
andrewcoh Mar 10, 2021
20c8759
Move common loss functions for PPO and POCA (#5079)
Mar 11, 2021
2ed7f46
Turn on the SimpleMultiAgentGroup
Mar 11, 2021
8511f9f
[poca] Remove add_groupmate_rewards from settings (#5082)
Mar 11, 2021
445c1f0
Merge branch 'main' into develop-coma2-trainer
Mar 11, 2021
f98c615
Untrack PB Collab Config
Mar 11, 2021
65af6ff
Update comment and fix reporting of group dones
Mar 11, 2021
7f92adc
reduce hybrid sac steps
andrewcoh Mar 11, 2021
7058eaa
Merge branch 'main' into develop-coma2-trainer
andrewcoh Mar 11, 2021
3ef5f17
Refactor extrinsic reward provider
Mar 11, 2021
83c3187
Create POCASettings class
Mar 11, 2021
461db66
use group reward + final reward to calculate ELO
andrewcoh Mar 11, 2021
9b1369f
Merge branch 'develop-coma2-trainer' of https://github.com/Unity-Tech…
andrewcoh Mar 11, 2021
d50b873
parameter docstring to POCA Value
andrewcoh Mar 11, 2021
5d3e500
rename group obs to groupmate obs
andrewcoh Mar 11, 2021
ff9bd1e
[poca] Make Observation Encoder a module (#5093)
Mar 12, 2021
100a7ac
Get processors out of observation_encoder
Mar 12, 2021
10d63ae
rename to groupmate obs (#5094)
andrewcoh Mar 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion com.unity.ml-agents/Runtime/SimpleMultiAgentGroup.cs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ namespace Unity.MLAgents
/// <summary>
/// A basic class implementation of MultiAgentGroup.
/// </summary>
internal class SimpleMultiAgentGroup : IMultiAgentGroup, IDisposable
public class SimpleMultiAgentGroup : IMultiAgentGroup, IDisposable
{
readonly int m_Id = MultiAgentGroupIdCounter.GetGroupId();
HashSet<Agent> m_Agents = new HashSet<Agent>();
Expand Down
6 changes: 6 additions & 0 deletions ml-agents/mlagents/trainers/buffer.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ class BufferKey(enum.Enum):
MASKS = "masks"
MEMORY = "memory"
CRITIC_MEMORY = "critic_memory"
BASELINE_MEMORY = "poca_baseline_memory"
PREV_ACTION = "prev_action"

ADVANTAGES = "advantages"
Expand Down Expand Up @@ -63,6 +64,7 @@ class RewardSignalKeyPrefix(enum.Enum):
VALUE_ESTIMATES = "value_estimates"
RETURNS = "returns"
ADVANTAGE = "advantage"
BASELINES = "baselines"


AgentBufferKey = Union[
Expand All @@ -87,6 +89,10 @@ def returns_key(name: str) -> AgentBufferKey:
def advantage_key(name: str) -> AgentBufferKey:
return RewardSignalKeyPrefix.ADVANTAGE, name

@staticmethod
def baseline_estimates_key(name: str) -> AgentBufferKey:
return RewardSignalKeyPrefix.BASELINES, name


class AgentBufferField(list):
"""
Expand Down
4 changes: 3 additions & 1 deletion ml-agents/mlagents/trainers/ghost/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,9 @@ def _process_trajectory(self, trajectory: Trajectory) -> None:
"""
if trajectory.done_reached:
# Assumption is that final reward is >0/0/<0 for win/draw/loss
final_reward = trajectory.steps[-1].reward
final_reward = (
trajectory.steps[-1].reward + trajectory.steps[-1].group_reward
)
result = 0.5
if final_reward > 0:
result = 1.0
Expand Down
28 changes: 16 additions & 12 deletions ml-agents/mlagents/trainers/optimizer/torch_optimizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,11 @@

from mlagents.trainers.policy.torch_policy import TorchPolicy
from mlagents.trainers.optimizer import Optimizer
from mlagents.trainers.settings import TrainerSettings
from mlagents.trainers.settings import (
TrainerSettings,
RewardSignalSettings,
RewardSignalType,
)
from mlagents.trainers.torch.utils import ModelUtils


Expand Down Expand Up @@ -44,7 +48,9 @@ def critic(self):
def update(self, batch: AgentBuffer, num_sequences: int) -> Dict[str, float]:
pass

def create_reward_signals(self, reward_signal_configs):
def create_reward_signals(
self, reward_signal_configs: Dict[RewardSignalType, RewardSignalSettings]
) -> None:
"""
Create reward signals
:param reward_signal_configs: Reward signal config.
Expand All @@ -56,7 +62,7 @@ def create_reward_signals(self, reward_signal_configs):
)

def _evaluate_by_sequence(
self, tensor_obs: List[torch.Tensor], initial_memory: np.ndarray
self, tensor_obs: List[torch.Tensor], initial_memory: torch.Tensor
) -> Tuple[Dict[str, torch.Tensor], AgentBufferField, torch.Tensor]:
"""
Evaluate a trajectory sequence-by-sequence, assembling the result. This enables us to get the
Expand All @@ -78,10 +84,8 @@ def _evaluate_by_sequence(
# Compute values for the potentially truncated initial sequence
seq_obs = []

first_seq_len = self.policy.sequence_length
first_seq_len = leftover if leftover > 0 else self.policy.sequence_length
for _obs in tensor_obs:
if leftover > 0:
first_seq_len = leftover
first_seq_obs = _obs[0:first_seq_len]
seq_obs.append(first_seq_obs)

Expand All @@ -106,13 +110,13 @@ def _evaluate_by_sequence(
seq_obs = []
for _ in range(self.policy.sequence_length):
all_next_memories.append(ModelUtils.to_numpy(_mem.squeeze()))
start = seq_num * self.policy.sequence_length - (
self.policy.sequence_length - leftover
)
end = (seq_num + 1) * self.policy.sequence_length - (
self.policy.sequence_length - leftover
)
for _obs in tensor_obs:
start = seq_num * self.policy.sequence_length - (
self.policy.sequence_length - leftover
)
end = (seq_num + 1) * self.policy.sequence_length - (
self.policy.sequence_length - leftover
)
seq_obs.append(_obs[start:end])
values, _mem = self.critic.critic_pass(
seq_obs, _mem, sequence_length=self.policy.sequence_length
Expand Down
Empty file.
Loading