Optimize dataset #199

cboulanger · 2022-10-24T17:00:02Z

According to your experience, performance does not improve whe adding more than a certain number of sequences (around 1500-2000 if I remember correctly). Also, I have experienced segmentation faults from Wapiti when I have tried it nonetheless with more, although that might have been related to structural issues of the data (such as leading or trailing <note> fields).

This means that just throwing more and more training data at the algorithm is not the smart way. Instead, the task would be to select those sequences that have the highest information entropy and tossing out those which merely repeat what has already been learned.

I am not a CS person so I am not in a good position to figure out how to do this, but maybe someone has an idea how this could be implemented with AnyStyle.

The text was updated successfully, but these errors were encountered:

cboulanger mentioned this issue Oct 25, 2022

Scramble non-open access finder datasets to avoid copyright issues #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize dataset #199

Optimize dataset #199

cboulanger commented Oct 24, 2022

Optimize dataset #199

Optimize dataset #199

Comments

cboulanger commented Oct 24, 2022