Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with MaskablePPO #195

Open
5 tasks done
koliber31 opened this issue Jul 12, 2023 · 16 comments
Open
5 tasks done

Problems with MaskablePPO #195

koliber31 opened this issue Jul 12, 2023 · 16 comments
Labels
custom gym env Issue related to Custom Gym Env

Comments

@koliber31
Copy link

🐛 Bug

Hi
I had problems with maskable ppo which I described here DLR-RM/stable-baselines3#1596. I thought that I found solution in one of issues #81 (comment). The problem is that error stopped occuring but in the same time agent lost its ability to learn. Below are screenshots of mean rewards, 150k and 260k timesteps are for case with error and 4M timesteps is for case without error.
150kSteps
260kSteps
wykresy
Unfortunately I don't have screenshots of learning process where agent menaged to get mean reward of ~-0.75 before error.

Code example

The only thing I changed since last issue in code is solution from #81

# Reinitialize with updated logits
        super().__init__(logits=logits, validate_args=False)

        # self.probs may already be cached, so we must force an update
        self.probs = logits_to_probs(self.logits)

Relevant log output / Error message

No response

System Info

No response

Checklist

@koliber31 koliber31 added the custom gym env Issue related to Custom Gym Env label Jul 12, 2023
@yiptsangkin
Copy link

i try to modify the source code change validate_args to False and it works

@yiptsangkin
Copy link

dtype also have problem sometimes

@koliber31
Copy link
Author

i try to modify the source code change validate_args to False and it works

The error stops occuring in my case too but it doesn't learn with validate_args=False.
Does it in your case? Could you provide any screen of learning curve?
What is dtype in your case? I tried int32, float32 and none didn't help.

@yiptsangkin
Copy link

same problem, i think this need some tech support to solve this problem.

@yiptsangkin
Copy link

pytorch/pytorch#87468

@koliber31
Copy link
Author

pytorch/pytorch#87468

Did you menage to solve problem using this solution from this issue?

@yiptsangkin
Copy link

yes,at least my env works!

@yiptsangkin
Copy link

pytorch/pytorch#87468

Did you menage to solve problem using this solution from this issue?

class _Simplex(Constraint):
...
def check(self, value):

  •     return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
    
  •     tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
    
  •     return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)
    

@koliber31
Copy link
Author

koliber31 commented Jul 22, 2023

It looks like this in my case

class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)

And my mask doesn't work after this change. Agent keeps making invalid moves.
Edit:
It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

@yiptsangkin
Copy link

It looks like this in my case

class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)

And my mask doesn't work after this change. Agent keeps making invalid moves. Edit: It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

this is the source code of torch, you can see the answer pytorch/pytorch#87468 (comment), edit the source code and it works. my env can learn without errors and the performance is same as before this change.

@yiptsangkin
Copy link

It looks like this in my case

class _Simplex(Constraint):
    """
    Constrain to the unit simplex in the innermost (rightmost) dimension.
    Specifically: `x >= 0` and `x.sum(-1) == 1`.
    """
    event_dim = 1

    def check(self, value):
        # return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < 1e-6)
        tol = torch.finfo(value.dtype).eps * 10 * value.size(-1) ** 0.5
        return torch.all(value >= 0, dim=-1) & ((value.sum(-1) - 1).abs() < tol)

And my mask doesn't work after this change. Agent keeps making invalid moves. Edit: It keeps making invalid moves even without this change. Had to make some mistake in between :(. Does this constraint look like this in your case and does it learn?

if the agent take error action, maybe you should check you mask of action.

@koliber31
Copy link
Author

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

@yiptsangkin
Copy link

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

@koliber31
Copy link
Author

koliber31 commented Jul 22, 2023

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

My rewards are just:
win = 1
lose = -1
draw = 0.25
Does your changed source code looks like mine? And did you change class _Symmetric too or just _Simplex?
Edit:
It still doesn't learn. Learning curves look exactly like these with validate_args=False
obraz

@yiptsangkin
Copy link

I did change source code and this is what it looks like after change. You are right my action mask had error. Now its seems to be learning but rewards is increasing slowly. And i still have problem that mean length of episode is increasing even though I know can win fast in some cases. I'll try to give feedback if it menaged to learn.

if the rewards increasing slowly, maybe is the problem of reward function.

My rewards are just: win = 1 lose = -1 draw = 0.25 Does your changed source code looks like mine? And did you change class _Symmetric too or just _Simplex? Edit: It still doesn't learn. Learning curves look exactly like these with validate_args=False obraz

just _Simplex. my learning curves looks no problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env
Projects
None yet
Development

No branches or pull requests

2 participants