WanJuan 2.0 (WanJuan-CC)

English🌎|简体中文🀄

WanJuan 2.0 (WanJuan-CC)

WanJuan2.0 (WanJuan-CC)(It will be hereafter referred to as "WanJuan-CC") is a high-quality English web text dataset efficiently extracted from CommonCrawl, consisting of 1.0T tokens. The results show that compared with various open-source English CC corpora evaluated on different dimensions of the Perspective API, WanJuan-CC exhibits higher safety. Moreover, its practicality is demonstrated through perplexity (PPL) on four validation sets and accuracy on six downstream tasks. WanJuan-CC shows competitive PPL performance on various validation sets, especially on sets like tiny-storys that require higher language fluency. By comparing with similar datasets in 1B model training, using the perplexity of the validation dataset and the accuracy of downstream tasks as evaluation indicators, it has been experimentally proven that WanJuan-CC significantly improves the performance of English text completion and general English ability tasks.

Data Processing Method

Brief Description of the Process

In the preparation phase of the WanJuan-CC dataset, the OpenDataLab team built a high-performance distributed data processing framework for data processing. The specific data processing flow is shown in the figure:

WanJuan-CC data processing flow

Extract text from the WARC format data of Common Crawl to obtain "Raw data".
Filter the raw data through heuristic rules to generate "Clean data".
Process the clean data using an LSH-based deduplication method to get "Dedup data".
Filter the dedup data using a filtering method based on keyword and domain lists, as well as a harmful content classifier and obscene content classifier based on Bert, to produce "Safe data".
Further filter the safe data using a Bert-based ad classifier and fluency classifier to get "High-Quality data".

Data Retention Rate at Each Step

We have counted the number of documents at each step and plotted the relative removal rate and absolute retention rate of each data processing step, as shown in the figure:

WanJuan-CC data processing flow relative removal rate and absolute retention rate

Data Information

Basic Information

Data modality: Pure text data;
Main language: English;
Data volume: Approximately 100B Tokens;
Data format: Corpus text and additional information stored in Jsonlines format.

Data Example

{
    "id": "Bk3aTI85qrqB0ZWjQ26l",
    "content": "This year I have been overwhelmed by moving house. The planning and organising has dominated my life: packing and unpacking boxes, trying to keep the family fed and watered, being there 100% for my children to settle in. My life - my aims and goals - have taken second place to the family chaos.\nYesterday, for the first time in ages, I sat down and thought: I want to write my book. I want to get this back into my time schedule. I want to make this space for me, for expressing myself, for being creative.\nSo what happens as of 3.15pm tomorrow? School holidays. Am I really likely to get a moment's peace?\nNow I'm planning: take the laptop on holiday. Work out how to blog from mobile (have to confess this is highly unlikely to work!) Consider booking children into holiday camps for entire 5 weeks. (Would Grannie like them?) Insist on a couple of hours a day, uninterrupted, simply to write.\nOr ... accept the reality. Enjoy the children while they are still prepared to tolerate me. Plan for a solid routine when term starts in September.\nWhich do you think will win?",
    "title": "Withenay Wanders: July 2009",
    "language": "en",
    "date": "2018-06-18T18:54:52Z",
    "token_num": 244,
    "cbytes_num": 1073,
    "line_num": 6,
    "char_num": 1073,
    "toxic_score": 0.001661,
    "porn_score": 0.003416,
    "fluency_score": 0.998535,
    "not_ad_score": 0.981934
}

Data Field Format

The following table records the field names, meanings, data types, and value descriptions of each field:

Field Name	Meaning	Data Type	Value Description
id	Unique Identifier	string	Globally unique identifier
content	Text Content	string	This field is used for pre-training
title	Title	string	Web page title
language	Language	string	"en"
date	Data Date	string	"2018-06-18T18:54:52Z"
token_num	Token size of content	int	Positive integer
cbytes_num	Byte size of content	int	Positive integer
line_num	Number of lines in content, separated by \n	int	Positive integer
char_num	Character size of content	int	Positive integer
toxic_score	Toxicity Score	float	[0.0~1.0], higher indicates higher likelihood of toxicity
porn_score	Pornography Score	float	[0.0~1.0], higher indicates higher likelihood of pornography
fluency_score	Fluency Score	float	[0.0~1.0], higher indicates better fluency
not_ad_score	Non-ad Score	float	[0.0~1.0], higher indicates higher likelihood of non-ad

Yearly Distribution

We have counted the yearly distribution of document numbers, storage space, and token numbers in the WanJuan-CC dataset, as shown in the figure:

WanJuan-CC dataset yearly distribution

Data Statistics Information Distribution

We used some statistical indicators to describe some basic information of the WanJuan-CC dataset and plotted a percentage histogram as shown below:

WanJuan-CC dataset statistics information distribution

Download and Decompression

Download data through the user interface:
- Click on dataset files to view all file lists in this dataset, select the required files, and then click the "Download" button to complete the download.
Download data through CLI, PythonSDK:
- Click on CLI/SDK download to view the download code, use CLI or Python SDK to perform data download operations.

License

WanJuan-CC as a whole adopts the CC BY 4.0 license agreement. You can freely share and adapt this dataset, but you must comply with the following conditions:

Attribution: You must appropriately acknowledge the author, provide a link to this license, and indicate whether changes were made to the original dataset. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For the full license content, please visit CC BY 4.0.

Special Notes

Please note that some subsets of this dataset may be subject to other agreement provisions. Before using a specific subset, please read the relevant agreement carefully to ensure compliant use. For more detailed agreement information, please check the relevant documents or metadata of the specific subset.

OpenDataLab, as a non-profit organization, advocates a harmonious and friendly open-source exchange environment. If you find content in the open-source dataset that infringes your legal rights, you can send an email to OpenDataLab@pjlab.org.cn. Please provide a detailed description of the infringement-related facts in the email and provide us with relevant ownership proof materials. We will initiate an investigation and handling mechanism within 3 working days and take necessary measures (such as delisting related data). However, you should ensure the authenticity of your complaint, otherwise, you should bear the adverse consequences caused by taking measures independently.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
readme_pic		readme_pic
README-zh.md		README-zh.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme_pic

readme_pic

README-zh.md

README-zh.md

README.md

README.md

Repository files navigation

WanJuan 2.0 (WanJuan-CC)