You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scenario
Currently, I have ran into an issue whereby I am not able to generalize the length of my observations (1-dimensional Box) and actions (Discrete), as my environment changes in size in accordance to the data that was provided to populate it. Some context, I am trying to solve a slotting optimization problem in warehouses, whereby different warehouse can have varying number of bins and items. As you might have imagined, the actions have swaps as actions and the number of actions can differ greatly between warehouses. As for the inputs, the length also depends on the number of bins and items as I am providing essential information pertaining to the demand of each item, distance of bin from packing zone, and etc.
I have looked pretty much everywhere (to no avail), but ChatGPT has suggested the use of padding. And this is how it goes:
Set the maximum length of observation and action
Populate the environment
Pad the observation and action if necessary
Use masking in the actions and observations so that these padded values do not affect the backpropagation
I found the apply_masking function too:
def apply_masking(self, masks: Optional[np.ndarray]) -> None:
"""
Eliminate ("mask out") chosen categorical outcomes by setting their probability to 0.
:param masks: An optional boolean ndarray of compatible shape with the distribution.
If True, the corresponding choice's logit value is preserved. If False, it is set
to a large negative value, resulting in near 0 probability. If masks is None, any
previously applied masking is removed, and the original logits are restored.
"""
if masks is not None:
device = self.logits.device
self.masks = th.as_tensor(masks, dtype=th.bool, device=device).reshape(self.logits.shape)
HUGE_NEG = th.tensor(-1e8, dtype=self.logits.dtype, device=device)
logits = th.where(self.masks, self._original_logits, HUGE_NEG)
else:
self.masks = None
logits = self._original_logits
# Reinitialize with updated logits
super().__init__(logits=logits)
# self.probs may already be cached, so we must force an update
self.probs = logits_to_probs(self.logits)
Edit: I found this in the sb3_contrib/common/maskable/policies.py file, can the apply_masking() method be applied to the features that is produced by the self.extract_features() method?
def forward(
self,
obs: th.Tensor,
deterministic: bool = False,
action_masks: Optional[np.ndarray] = None,
) -> Tuple[th.Tensor, th.Tensor, th.Tensor]:
"""
Forward pass in all the networks (actor and critic)
:param obs: Observation
:param deterministic: Whether to sample or use deterministic actions
:param action_masks: Action masks to apply to the action distribution
:return: action, value and log probability of the action
"""
# Preprocess the observation if needed
features = self.extract_features(obs)
if self.share_features_extractor:
latent_pi, latent_vf = self.mlp_extractor(features)
else:
pi_features, vf_features = features
latent_pi = self.mlp_extractor.forward_actor(pi_features)
latent_vf = self.mlp_extractor.forward_critic(vf_features)
# Evaluate the values for the given observations
values = self.value_net(latent_vf)
distribution = self._get_action_dist_from_latent(latent_pi)
if action_masks is not None:
distribution.apply_masking(action_masks)
actions = distribution.get_actions(deterministic=deterministic)
log_prob = distribution.log_prob(actions)
return actions, values, log_prob
Question/Request
This brought my attention to MaskablePPO whereby the masking of actions have already been implemented, as such, I would like to ask for advice as to how I would modify the algorithm to mask the inputs as well. Please provide filepaths to the respective source code for modifications too! Thank you :)
Remarks
If anyone has encountered this problem and solved it, could you share your insights as to how you do it? If anyone has any idea to generalized the inputs without the need of masking please do comment away!
Checklist
I have checked that there is no similar issue in the repo
❓ Question
Hi sb3-contrib community,
Scenario
Currently, I have ran into an issue whereby I am not able to generalize the length of my observations (1-dimensional Box) and actions (Discrete), as my environment changes in size in accordance to the data that was provided to populate it. Some context, I am trying to solve a slotting optimization problem in warehouses, whereby different warehouse can have varying number of bins and items. As you might have imagined, the actions have swaps as actions and the number of actions can differ greatly between warehouses. As for the inputs, the length also depends on the number of bins and items as I am providing essential information pertaining to the demand of each item, distance of bin from packing zone, and etc.
I have looked pretty much everywhere (to no avail), but ChatGPT has suggested the use of padding. And this is how it goes:
I found the apply_masking function too:
Edit: I found this in the sb3_contrib/common/maskable/policies.py file, can the apply_masking() method be applied to the
features
that is produced by theself.extract_features()
method?Question/Request
This brought my attention to MaskablePPO whereby the masking of actions have already been implemented, as such, I would like to ask for advice as to how I would modify the algorithm to mask the inputs as well. Please provide filepaths to the respective source code for modifications too! Thank you :)
Remarks
If anyone has encountered this problem and solved it, could you share your insights as to how you do it? If anyone has any idea to generalized the inputs without the need of masking please do comment away!
Checklist
The text was updated successfully, but these errors were encountered: