Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct data format for fine-tuning RUGPT3 models #109

Open
Futyn-Maker opened this issue Mar 19, 2023 · 0 comments
Open

Correct data format for fine-tuning RUGPT3 models #109

Futyn-Maker opened this issue Mar 19, 2023 · 0 comments

Comments

@Futyn-Maker
Copy link

Hello!

I'm learning how to fine-tune RuGPT3 models with my own dataset to generate similar texts. I'm wondering if there is the documentation describing the right dataset format and a list of special tokens.

The specific questions are following:

  1. The problem is that in my dataset there are both one-line and multiline samples, and I'm wondering how to separate them from each other, as it seems to be assumed that new line is a separator by default.
  2. All the texts in my corpus are of the same type (for example, let's say that these are jokes, but they cannot be combined on some big topics) and I want to generate a new text without a specific input, e.g. I don't assume to give beginning of some text. Should I use in my dataset a keyword like "Анекдот", e.g. "Анекдот: <text_1>", and then use this keyword as a prompt? If so, do I need some special token for that word?

I would be grateful for any information on the data format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant