Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize dataset #199

Open
cboulanger opened this issue Oct 24, 2022 · 0 comments
Open

Optimize dataset #199

cboulanger opened this issue Oct 24, 2022 · 0 comments

Comments

@cboulanger
Copy link
Contributor

According to your experience, performance does not improve whe adding more than a certain number of sequences (around 1500-2000 if I remember correctly). Also, I have experienced segmentation faults from Wapiti when I have tried it nonetheless with more, although that might have been related to structural issues of the data (such as leading or trailing <note> fields).

This means that just throwing more and more training data at the algorithm is not the smart way. Instead, the task would be to select those sequences that have the highest information entropy and tossing out those which merely repeat what has already been learned.

I am not a CS person so I am not in a good position to figure out how to do this, but maybe someone has an idea how this could be implemented with AnyStyle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant