Correct data format for fine-tuning RUGPT3 models #109

Futyn-Maker · 2023-03-19T20:08:59Z

Hello!

I'm learning how to fine-tune RuGPT3 models with my own dataset to generate similar texts. I'm wondering if there is the documentation describing the right dataset format and a list of special tokens.

The specific questions are following:

The problem is that in my dataset there are both one-line and multiline samples, and I'm wondering how to separate them from each other, as it seems to be assumed that new line is a separator by default.
All the texts in my corpus are of the same type (for example, let's say that these are jokes, but they cannot be combined on some big topics) and I want to generate a new text without a specific input, e.g. I don't assume to give beginning of some text. Should I use in my dataset a keyword like "Анекдот", e.g. "Анекдот: <text_1>", and then use this keyword as a prompt? If so, do I need some special token for that word?

I would be grateful for any information on the data format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct data format for fine-tuning RUGPT3 models #109

Correct data format for fine-tuning RUGPT3 models #109

Futyn-Maker commented Mar 19, 2023

Correct data format for fine-tuning RUGPT3 models #109

Correct data format for fine-tuning RUGPT3 models #109

Comments

Futyn-Maker commented Mar 19, 2023