Skip to content

Commit ad2680e

Browse files
authored
Set ignore done=False in GAIL (#4971)
1 parent d2c5697 commit ad2680e

17 files changed

+116
-59
lines changed

com.unity.ml-agents/CHANGELOG.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ and this project adheres to
1414
### Minor Changes
1515
#### com.unity.ml-agents / com.unity.ml-agents.extensions (C#)
1616
#### ml-agents / ml-agents-envs / gym-unity (Python)
17-
17+
- The `encoding_size` setting for RewardSignals has been deprecated. Please use `network_settings` instead. (#4982)
1818
### Bug Fixes
1919
#### com.unity.ml-agents (C#)
2020
#### ml-agents / ml-agents-envs / gym-unity (Python)
21-
21+
- An issue that caused `GAIL` to fail for environments where agents can terminate episodes by self-sacrifice has been fixed. (#4971)
2222

2323
## [1.8.0-preview] - 2021-02-17
2424
### Major Changes

config/imitation/CrawlerStatic.yaml

+5-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,11 @@ behaviors:
1919
gail:
2020
gamma: 0.99
2121
strength: 1.0
22-
encoding_size: 128
22+
network_settings:
23+
normalize: true
24+
hidden_units: 128
25+
num_layers: 2
26+
vis_encode_type: simple
2327
learning_rate: 0.0003
2428
use_actions: false
2529
use_vail: false

config/imitation/FoodCollector.yaml

+5-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,11 @@ behaviors:
1919
gail:
2020
gamma: 0.99
2121
strength: 0.1
22-
encoding_size: 128
22+
network_settings:
23+
normalize: false
24+
hidden_units: 128
25+
num_layers: 2
26+
vis_encode_type: simple
2327
learning_rate: 0.0003
2428
use_actions: false
2529
use_vail: false

config/imitation/Hallway.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,7 @@ behaviors:
2424
strength: 1.0
2525
gail:
2626
gamma: 0.99
27-
strength: 0.1
28-
encoding_size: 128
27+
strength: 0.01
2928
learning_rate: 0.0003
3029
use_actions: false
3130
use_vail: false

config/imitation/PushBlock.yaml

+15-3
Original file line numberDiff line numberDiff line change
@@ -16,16 +16,28 @@ behaviors:
1616
num_layers: 2
1717
vis_encode_type: simple
1818
reward_signals:
19-
gail:
19+
extrinsic:
2020
gamma: 0.99
2121
strength: 1.0
22-
encoding_size: 128
22+
gail:
23+
gamma: 0.99
24+
strength: 0.01
25+
network_settings:
26+
normalize: false
27+
hidden_units: 128
28+
num_layers: 2
29+
vis_encode_type: simple
2330
learning_rate: 0.0003
2431
use_actions: false
2532
use_vail: false
2633
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
2734
keep_checkpoints: 5
28-
max_steps: 15000000
35+
max_steps: 1000000
2936
time_horizon: 64
3037
summary_freq: 60000
3138
threaded: true
39+
behavioral_cloning:
40+
demo_path: Project/Assets/ML-Agents/Examples/PushBlock/Demos/ExpertPush.demo
41+
steps: 50000
42+
strength: 1.0
43+
samples_per_update: 0

config/imitation/Pyramids.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ behaviors:
2222
curiosity:
2323
strength: 0.02
2424
gamma: 0.99
25-
encoding_size: 256
25+
network_settings:
26+
hidden_units: 256
2627
gail:
2728
strength: 0.01
2829
gamma: 0.99
29-
encoding_size: 128
3030
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo
3131
behavioral_cloning:
3232
demo_path: Project/Assets/ML-Agents/Examples/Pyramids/Demos/ExpertPyramid.demo

config/ppo/Pyramids.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ behaviors:
2222
curiosity:
2323
gamma: 0.99
2424
strength: 0.02
25-
encoding_size: 256
25+
network_settings:
26+
hidden_units: 256
2627
learning_rate: 0.0003
2728
keep_checkpoints: 5
2829
max_steps: 10000000

config/ppo/PyramidsRND.yaml

+2-2
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ behaviors:
2222
rnd:
2323
gamma: 0.99
2424
strength: 0.01
25-
encoding_size: 64
25+
network_settings:
26+
hidden_units: 64
2627
learning_rate: 0.0001
2728
keep_checkpoints: 5
2829
max_steps: 3000000
2930
time_horizon: 128
3031
summary_freq: 30000
31-
framework: pytorch
3232
threaded: true

config/ppo/VisualPyramids.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ behaviors:
2222
curiosity:
2323
gamma: 0.99
2424
strength: 0.01
25-
encoding_size: 256
25+
network_settings:
26+
hidden_units: 256
2627
learning_rate: 0.0003
2728
keep_checkpoints: 5
2829
max_steps: 10000000

config/sac/Pyramids.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ behaviors:
2424
gail:
2525
gamma: 0.99
2626
strength: 0.01
27-
encoding_size: 128
2827
learning_rate: 0.0003
2928
use_actions: true
3029
use_vail: false

config/sac/VisualPyramids.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@ behaviors:
2424
gail:
2525
gamma: 0.99
2626
strength: 0.02
27-
encoding_size: 128
2827
learning_rate: 0.0003
2928
use_actions: true
3029
use_vail: false

docs/ML-Agents-Overview.md

+17-6
Original file line numberDiff line numberDiff line change
@@ -472,12 +472,23 @@ Learning (GAIL). In most scenarios, you can combine these two features:
472472
- If you want to help your agents learn (especially with environments that have
473473
sparse rewards) using pre-recorded demonstrations, you can generally enable
474474
both GAIL and Behavioral Cloning at low strengths in addition to having an
475-
extrinsic reward. An example of this is provided for the Pyramids example
476-
environment under `PyramidsLearning` in `config/gail_config.yaml`.
477-
- If you want to train purely from demonstrations, GAIL and BC _without_ an
478-
extrinsic reward signal is the preferred approach. An example of this is
479-
provided for the Crawler example environment under `CrawlerStaticLearning` in
480-
`config/gail_config.yaml`.
475+
extrinsic reward. An example of this is provided for the PushBlock example
476+
environment in `config/imitation/PushBlock.yaml`.
477+
- If you want to train purely from demonstrations with GAIL and BC _without_ an
478+
extrinsic reward signal, please see the CrawlerStatic example environment under
479+
in `config/imitation/CrawlerStatic.yaml`.
480+
481+
***Note:*** GAIL introduces a [_survivor bias_](https://arxiv.org/pdf/1809.02925.pdf)
482+
to the learning process. That is, by giving positive rewards based on similarity
483+
to the expert, the agent is incentivized to remain alive for as long as possible.
484+
This can directly conflict with goal-oriented tasks like our PushBlock or Pyramids
485+
example environments where an agent must reach a goal state thus ending the
486+
episode as quickly as possible. In these cases, we strongly recommend that you
487+
use a low strength GAIL reward signal and a sparse extrinisic signal when
488+
the agent achieves the task. This way, the GAIL reward signal will guide the
489+
agent until it discovers the extrnisic signal and will not overpower it. If the
490+
agent appears to be ignoring the extrinsic reward signal, you should reduce
491+
the strength of GAIL.
481492

482493
#### GAIL (Generative Adversarial Imitation Learning)
483494

docs/Training-Configuration-File.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ To enable curiosity, provide these settings:
101101
| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
102102
| `curiosity -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic curiosity module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.1` |
103103
| `curiosity -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
104-
| `curiosity -> encoding_size` | (default = `64`) Size of the encoding used by the intrinsic curiosity model. This value should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
104+
| `curiosity -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs used by the intrinsic curiosity model. The value should of `hidden_units` should be small enough to encourage the ICM to compress the original observation, but also not too small to prevent it from learning to differentiate between expected and actual observations. <br><br>Typical range: `64` - `256` |
105105
| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the intrinsic curiosity module. This should typically be decreased if training is unstable, and the curiosity loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
106106

107107
### GAIL Intrinsic Reward
@@ -114,7 +114,7 @@ settings:
114114
| `gail -> strength` | (default = `1.0`) Factor by which to multiply the raw reward. Note that when using GAIL with an Extrinsic Signal, this value should be set lower if your demonstrations are suboptimal (e.g. from a human), so that a trained agent will focus on receiving extrinsic rewards instead of exactly copying the demonstrations. Keep the strength below about 0.1 in those cases. <br><br>Typical range: `0.01` - `1.0` |
115115
| `gail -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.9` |
116116
| `gail -> demo_path` | (Required, no default) The path to your .demo file or directory of .demo files. |
117-
| `gail -> encoding_size` | (default = `64`) Size of the hidden layer used by the discriminator. This value should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` |
117+
| `gail -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the GAIL discriminator. The value of `hidden_units` should be small enough to encourage the discriminator to compress the original observation, but also not too small to prevent it from learning to differentiate between demonstrated and actual behavior. Dramatically increasing this size will also negatively affect training times. <br><br>Typical range: `64` - `256` |
118118
| `gail -> learning_rate` | (Optional, default = `3e-4`) Learning rate used to update the discriminator. This should typically be decreased if training is unstable, and the GAIL loss is unstable. <br><br>Typical range: `1e-5` - `1e-3` |
119119
| `gail -> use_actions` | (default = `false`) Determines whether the discriminator should discriminate based on both observations and actions, or just observations. Set to True if you want the agent to mimic the actions from the demonstrations, and False if you'd rather have the agent visit the same states as in the demonstrations but with possibly different actions. Setting to False is more likely to be stable, especially with imperfect demonstrations, but may learn slower. |
120120
| `gail -> use_vail` | (default = `false`) Enables a variational bottleneck within the GAIL discriminator. This forces the discriminator to learn a more general representation and reduces its tendency to be "too good" at discriminating, making learning more stable. However, it does increase training time. Enable this if you notice your imitation learning is unstable, or unable to learn the task at hand. |
@@ -128,7 +128,7 @@ To enable RND, provide these settings:
128128
| :--------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
129129
| `rnd -> strength` | (default = `1.0`) Magnitude of the curiosity reward generated by the intrinsic rnd module. This should be scaled in order to ensure it is large enough to not be overwhelmed by extrinsic reward signals in the environment. Likewise it should not be too large to overwhelm the extrinsic reward signal. <br><br>Typical range: `0.001` - `0.01` |
130130
| `rnd -> gamma` | (default = `0.99`) Discount factor for future rewards. <br><br>Typical range: `0.8` - `0.995` |
131-
| `rnd -> encoding_size` | (default = `64`) Size of the encoding used by the intrinsic RND model. <br><br>Typical range: `64` - `256` |
131+
| `rnd -> network_settings` | Please see the documentation for `network_settings` under [Common Trainer Configurations](#common-trainer-configurations). The network specs for the RND model. |
132132
| `curiosity -> learning_rate` | (default = `3e-4`) Learning rate used to update the RND module. This should be large enough for the RND module to quickly learn the state representation, but small enough to allow for stable learning. <br><br>Typical range: `1e-5` - `1e-3`
133133

134134

ml-agents/mlagents/trainers/settings.py

+17-3
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,7 @@ def to_settings(self) -> type:
183183
class RewardSignalSettings:
184184
gamma: float = 0.99
185185
strength: float = 1.0
186+
network_settings: NetworkSettings = attr.ib(factory=NetworkSettings)
186187

187188
@staticmethod
188189
def structure(d: Mapping, t: type) -> Any:
@@ -198,28 +199,41 @@ def structure(d: Mapping, t: type) -> Any:
198199
enum_key = RewardSignalType(key)
199200
t = enum_key.to_settings()
200201
d_final[enum_key] = strict_to_cls(val, t)
202+
# Checks to see if user specifying deprecated encoding_size for RewardSignals.
203+
# If network_settings is not specified, this updates the default hidden_units
204+
# to the value of encoding size. If specified, this ignores encoding size and
205+
# uses network_settings values.
206+
if "encoding_size" in val:
207+
logger.warning(
208+
"'encoding_size' was deprecated for RewardSignals. Please use network_settings."
209+
)
210+
# If network settings was not specified, use the encoding size. Otherwise, use hidden_units
211+
if "network_settings" not in val:
212+
d_final[enum_key].network_settings.hidden_units = val[
213+
"encoding_size"
214+
]
201215
return d_final
202216

203217

204218
@attr.s(auto_attribs=True)
205219
class GAILSettings(RewardSignalSettings):
206-
encoding_size: int = 64
207220
learning_rate: float = 3e-4
221+
encoding_size: Optional[int] = None
208222
use_actions: bool = False
209223
use_vail: bool = False
210224
demo_path: str = attr.ib(kw_only=True)
211225

212226

213227
@attr.s(auto_attribs=True)
214228
class CuriositySettings(RewardSignalSettings):
215-
encoding_size: int = 64
216229
learning_rate: float = 3e-4
230+
encoding_size: Optional[int] = None
217231

218232

219233
@attr.s(auto_attribs=True)
220234
class RNDSettings(RewardSignalSettings):
221-
encoding_size: int = 64
222235
learning_rate: float = 1e-4
236+
encoding_size: Optional[int] = None
223237

224238

225239
# SAMPLERS #############################################################################

ml-agents/mlagents/trainers/torch/components/reward_providers/curiosity_reward_provider.py

+17-11
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,16 @@
99
from mlagents.trainers.settings import CuriositySettings
1010

1111
from mlagents_envs.base_env import BehaviorSpec
12+
from mlagents_envs import logging_util
1213
from mlagents.trainers.torch.agent_action import AgentAction
1314
from mlagents.trainers.torch.action_flattener import ActionFlattener
1415
from mlagents.trainers.torch.utils import ModelUtils
1516
from mlagents.trainers.torch.networks import NetworkBody
1617
from mlagents.trainers.torch.layers import LinearEncoder, linear_layer
17-
from mlagents.trainers.settings import NetworkSettings, EncoderType
1818
from mlagents.trainers.trajectory import ObsUtil
1919

20+
logger = logging_util.get_logger(__name__)
21+
2022

2123
class ActionPredictionTuple(NamedTuple):
2224
continuous: torch.Tensor
@@ -70,21 +72,22 @@ class CuriosityNetwork(torch.nn.Module):
7072
def __init__(self, specs: BehaviorSpec, settings: CuriositySettings) -> None:
7173
super().__init__()
7274
self._action_spec = specs.action_spec
73-
state_encoder_settings = NetworkSettings(
74-
normalize=False,
75-
hidden_units=settings.encoding_size,
76-
num_layers=2,
77-
vis_encode_type=EncoderType.SIMPLE,
78-
memory=None,
79-
)
75+
76+
state_encoder_settings = settings.network_settings
77+
if state_encoder_settings.memory is not None:
78+
state_encoder_settings.memory = None
79+
logger.warning(
80+
"memory was specified in network_settings but is not supported by Curiosity. It is being ignored."
81+
)
82+
8083
self._state_encoder = NetworkBody(
8184
specs.observation_specs, state_encoder_settings
8285
)
8386

8487
self._action_flattener = ActionFlattener(self._action_spec)
8588

8689
self.inverse_model_action_encoding = torch.nn.Sequential(
87-
LinearEncoder(2 * settings.encoding_size, 1, 256)
90+
LinearEncoder(2 * state_encoder_settings.hidden_units, 1, 256)
8891
)
8992

9093
if self._action_spec.continuous_size > 0:
@@ -98,9 +101,12 @@ def __init__(self, specs: BehaviorSpec, settings: CuriositySettings) -> None:
98101

99102
self.forward_model_next_state_prediction = torch.nn.Sequential(
100103
LinearEncoder(
101-
settings.encoding_size + self._action_flattener.flattened_size, 1, 256
104+
state_encoder_settings.hidden_units
105+
+ self._action_flattener.flattened_size,
106+
1,
107+
256,
102108
),
103-
linear_layer(256, settings.encoding_size),
109+
linear_layer(256, state_encoder_settings.hidden_units),
104110
)
105111

106112
def get_current_state(self, mini_batch: AgentBuffer) -> torch.Tensor:

0 commit comments

Comments
 (0)