README.md
8.5 KB · 140 lines · markdown Raw
1 ---
2 license: odc-by
3 viewer: false
4 task_categories:
5 - text-generation
6 language:
7 - en
8 tags:
9 - language-modeling
10 - casual-lm
11 - llm
12 pretty_name: Dolma
13 size_categories:
14 - n>1T
15 ---
16
17 # Dolma
18
19 <img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://raw.githubusercontent.com/allenai/dolma/main/docs/assets/AI2_Blog_1400x685_2x.webp" width="100%">
20
21 Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
22
23 More information:
24
25 - Read Dolma **manuscript** and its **Data Sheet** [on ArXiv](https://arxiv.org/abs/2402.00159);
26 - Explore the [**open source tools**](https://github.com/allenai/dolma) we created to curate Dolma.
27 - Want to request removal of personal data? Use [this form](https://forms.gle/q4BNUUxUxKwKkfdT6) to notify us of documents containing PII about a specific user.
28
29 To learn more about the toolkit used to create Dolma, including how to replicate this dataset, head over our [GitHub project page](https://github.com/allenai/dolma/tree/main/docs)!
30
31 **2024-04-17: Dolma v1.7 Release.** We have released an updated version of Dolma that we used to train our latest [OLMo 7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7) model.
32
33 **2024-04-15: License Change.** We have updated the license of Dolma to [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). Please see this [blog post](https://blog.allenai.org/making-a-switch-dolma-moves-to-odc-by-8f0e73852f44) for more information.
34
35
36 ## Versions
37
38 At the moment, there are six versions of Dolma available:
39
40 | **Version** | **Default?** | **Release Date** | **Size** (gzip) | **Description** |
41 |--|:--:|--|--|--|
42 | `v1_7` | ✅ | 2024-04-15 | 4.5 TB | Used to train [OLMo-7B-v1.7](https://huggingface.co/allenai/OLMo-7b-v1.7). New sources, more quality filtering, fuzzy deduplication. |
43 | `v1_6` | | 2024-01-31 | 5.4 TB | An update to v1.5 with some deduplication of documents with too few tokens or too many repeated n-grams. |
44 | `v1_6-sample` | | 2024-01-31 | 16.4 GB | A smaller sample of Dolma, with roughly 10 billion tokens. Useful for data exploration. |
45 | `v1_5` | | 2023-10-31 | 6.4 TB | Used to train [OLMo-1B](https://huggingface.co/allenai/OLMo-1B). Roughly 3 trillion tokens. |
46 | `v1_5-sample` | | 2023-10-31 | 2.9 TB | A sample of roughly 1.9 trillion tokens used to train [OLMo-7B](https://huggingface.co/allenai/OLMo-7B) |
47 | `v1` | | 2023-08-18 | 6.0 TB | The first version of Dolma. |
48
49
50 ## Summary Statistics (v1.7)
51
52 | **Source** | **Provenance** | **New?** | **Documents** (millions) | **OLMo tokens** (billions) | **Sample Proportion** | **Cutoff Date** | **Processing**
53 |--|--|--|--|--|--|--|--|
54 | Dolma's CC | [Common Crawl](https://commoncrawl.org/) via Dolma v1.6 | Updated | 875.2 | 1,195.5 | 50% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. |
55 | Refined Web | [Refined Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) | Yes | 664.0 | 456.4 | 100% | Feb 2023 | Filtered using the Dolma pipeline; new quality filtering and deduplication steps. |
56 | StarCoder | [StarCoder](https://huggingface.co/blog/starcoder) | Yes | 206.6 | 263.8 | 100% | May 2023 | No further processing. |
57 | C4 | [C4](https://huggingface.co/datasets/c4) via Dolma v1.6 | Updated | 249.9 | 138.4 | 50% | Apr 2019 | Filtered using the Dolma pipeline; new quality filtering and deduplication steps. |
58 | Reddit | [PushShift API](https://github.com/pushshift/api) | Updated | 377.4 | 79.9 | 100% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. |
59 | Semantic Scholar ([S2ORC](https://aclanthology.org/2020.acl-main.447/) & [S2AG](https://www.semanticscholar.org/product/api)) | [peS2o](https://huggingface.co/datasets/allenai/peS2o) via Dolma v1.6 | No | 38.8 | 57.2 | 100% | Mar 2023 | Same as Dolma v1.6 |
60 | arXiv | [RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | Yes | 1.5 | 28.0 | 100% | Mar 2023 | No further processing. |
61 | StackExchange | [RedPajama v1](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T) | Yes | 29.3 | 19.6 | 100% | Mar 2023 | No further processing. |
62 | Flan | [Flan Collection](https://arxiv.org/abs/2301.13688), reproduced following the [original code](https://github.com/google-research/FLAN/tree/main/flan/v2), as performed by [Dettmers et al., (2023)](https://openreview.net/forum?id=OUIFPHEgJU) | Yes | 52.1 | 16.5 | 100% | Feb 2023 | After reproducing Flan, sampled to balance different Flan subsets. Reformatted for pretraining with newlines separating instruction and demonstration. |
63 | CC News | [Common Crawl](https://commoncrawl.org/blog/news-dataset-available) | Yes | 22.0 | 14.3 | 100% | Mar 2023 | Extracted using the Dolma pipeline; new quality filtering and deduplication steps. |
64 | OpenWebMath | [OpenWebMath](https://huggingface.co/datasets/open-web-math/open-web-math) via [Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | Yes | 2.9 | 12.6 | 100% | May 2023 | Training subset; no further processing. |
65 | Algebraic Stack | [Proof Pile II](https://huggingface.co/datasets/EleutherAI/proof-pile-2) | Yes | 2.8 | 12.6 | 100% | Oct 2023 | Training subset; no further processing. |
66 | Project Gutenberg | [Project Gutenberg](https://www.gutenberg.org) via Dolma v1.6 | No | 0.0556 | 5.3 | 100% | Mar 2023 | Same as Dolma v1.6 |
67 | MegaWika | [MetaWika](https://huggingface.co/datasets/hltcoe/megawika) | Yes | 3.2 | 4.6 | 100% | Jul 2023 | English web pages cited from Wikipedia; curated using the full Dolma pipeline. |
68 | Wikipedia & Wikibooks | [Wikimedia](https://dumps.wikimedia.org) via Dolma v1.6 | No | 6.2 | 3.7 | 200% | Mar 2023 | Same as Dolma v1.6 |
69 | **Total** | | | **2532.0** | **2,308.5** | **1,715.1** | **Oct 2023** | |
70
71 (A subset of total data was used for training of OLMo 7B-v1.7. The token counts are based on the full dataset, whereas taking into account sampling proportion gives the final actual token counts used for training --- 1.715 trillion tokens.)
72
73
74 ## Summary Statistics (v1.6)
75
76 | **Source** | **Doc Type** | **UTF-8 bytes** (GB) | **Documents** (millions) | **Unicode words** (billions) | **Llama tokens** (billions) |
77 |--|--|--|--|--|--|
78 | Common Crawl | web pages | 9,022 | 3,370 | 1,775 | 2,281 |
79 | The Stack | code| 1,043| 210 | 260| 411 |
80 | C4 | web pages | 790 | 364 | 153| 198 |
81 | Reddit| social media| 339 | 377| 72| 89 |
82 | PeS2o | STEM papers| 268 | 38.8| 50| 70 |
83 | Project Gutenberg | books | 20.4 | 0.056 | 4.0 | 6.0 |
84 | Wikipedia, Wikibooks | encyclopedic | 16.2 | 6.2 | 3.7 | 4.3 |
85 | **Total** | | **11,519** | **4,367** | **2,318** | **3,059** |
86
87
88 ## Download
89
90 The fastest way to download Dolma is to clone this repository and use the files in the `url` directory.
91 We recommend using wget in parallel mode to download the files. For example:
92
93 ```bash
94 DATA_DIR="<path_to_your_data_directory>"
95 PARALLEL_DOWNLOADS="<number_of_parallel_downloads>"
96 DOLMA_VERSION="<version_of_dolma_to_download>"
97
98 git clone https://huggingface.co/datasets/allenai/dolma
99 mkdir -p "${DATA_DIR}"
100
101
102 cat "dolma/urls/${DOLMA_VERSION}.txt" | xargs -n 1 -P "${PARALLEL_DOWNLOADS}" wget -q -P "$DATA_DIR"
103 ```
104
105 Then, to load this data using HuggingFace's `datasets` library, you can use the following code:
106
107 ```python
108 import os
109 from datasets import load_dataset
110
111 os.environ["DATA_DIR"] = "<path_to_your_data_directory>"
112 dataset = load_dataset("allenai/dolma", split="train")
113 ```
114
115 ### Licensing Information
116
117 We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/).
118 By using this dataset, you are also bound any license agreements and terms of use of the original data sources.
119
120 ## Bibtex
121
122 If you use our dataset or tooling, please cite us at:
123
124 ```bibtex
125 @article{dolma,
126 title = {{Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
127 author={
128 Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and
129 Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and
130 Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and
131 Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and
132 Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and
133 Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and
134 Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo
135 },
136 year = {2024},
137 journal={arXiv preprint},
138 }
139 ```
140