README.md
10.4 KB · 403 lines · markdown Raw
1 ---
2 license: apache-2.0
3 dataset_info:
4 features:
5 - name: message_id
6 dtype: string
7 - name: parent_id
8 dtype: string
9 - name: user_id
10 dtype: string
11 - name: created_date
12 dtype: string
13 - name: text
14 dtype: string
15 - name: role
16 dtype: string
17 - name: lang
18 dtype: string
19 - name: review_count
20 dtype: int32
21 - name: review_result
22 dtype: bool
23 - name: deleted
24 dtype: bool
25 - name: rank
26 dtype: int32
27 - name: synthetic
28 dtype: bool
29 - name: model_name
30 dtype: string
31 - name: detoxify
32 struct:
33 - name: toxicity
34 dtype: float64
35 - name: severe_toxicity
36 dtype: float64
37 - name: obscene
38 dtype: float64
39 - name: identity_attack
40 dtype: float64
41 - name: insult
42 dtype: float64
43 - name: threat
44 dtype: float64
45 - name: sexual_explicit
46 dtype: float64
47 - name: message_tree_id
48 dtype: string
49 - name: tree_state
50 dtype: string
51 - name: emojis
52 sequence:
53 - name: name
54 dtype: string
55 - name: count
56 dtype: int32
57 - name: labels
58 sequence:
59 - name: name
60 dtype: string
61 - name: value
62 dtype: float64
63 - name: count
64 dtype: int32
65 splits:
66 - name: train
67 num_bytes: 158850455
68 num_examples: 128575
69 - name: validation
70 num_bytes: 7963122
71 num_examples: 6599
72 download_size: 66674129
73 dataset_size: 166813577
74 language:
75 - en
76 - es
77 - ru
78 - de
79 - pl
80 - th
81 - vi
82 - sv
83 - bn
84 - da
85 - he
86 - it
87 - fa
88 - sk
89 - id
90 - nb
91 - el
92 - nl
93 - hu
94 - eu
95 - zh
96 - eo
97 - ja
98 - ca
99 - cs
100 - bg
101 - fi
102 - pt
103 - tr
104 - ro
105 - ar
106 - uk
107 - gl
108 - fr
109 - ko
110 tags:
111 - human-feedback
112 size_categories:
113 - 100K<n<1M
114 pretty_name: OpenAssistant Conversations Release 2
115 ---
116
117 # Open Assistant Conversations Dataset Release 2 (OASST2)
118
119 ## Dataset Description
120
121 - **Homepage:** https://www.open-assistant.io/
122 - **Repository:** https://github.com/LAION-AI/Open-Assistant
123 - **Paper:** https://arxiv.org/abs/2304.07327
124
125 ### Dataset Structure
126
127 This dataset contains message trees. Each message tree has an initial prompt message as the root node,
128 which can have multiple child messages as replies, and these child messages can have multiple replies.
129
130 All messages have a role property: this can either be "assistant" or "prompter". The roles in
131 conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".
132
133 This version of the dataset contains data collected on the [open-assistant.io](https://open-assistant.io/) website until Nov 5 2023.
134
135 ### JSON Example: Message
136
137 For readability, the following JSON examples are shown formatted with indentation on multiple lines.
138 Objects are stored without indentation (on single lines) in the actual jsonl files.
139
140 ```json
141 {
142 "message_id": "218440fd-5317-4355-91dc-d001416df62b",
143 "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
144 "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
145 "text": "It was the winter of 2035, and artificial intelligence (..)",
146 "role": "assistant",
147 "lang": "en",
148 "review_count": 3,
149 "review_result": true,
150 "deleted": false,
151 "rank": 0,
152 "synthetic": true,
153 "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
154 "labels": {
155 "spam": { "value": 0.0, "count": 3 },
156 "lang_mismatch": { "value": 0.0, "count": 3 },
157 "pii": { "value": 0.0, "count": 3 },
158 "not_appropriate": { "value": 0.0, "count": 3 },
159 "hate_speech": { "value": 0.0, "count": 3 },
160 "sexual_content": { "value": 0.0, "count": 3 },
161 "quality": { "value": 0.416, "count": 3 },
162 "toxicity": { "value": 0.16, "count": 3 },
163 "humor": { "value": 0.0, "count": 3 },
164 "creativity": { "value": 0.33, "count": 3 },
165 "violence": { "value": 0.16, "count": 3 }
166 }
167 }
168 ```
169
170 ### JSON Example: Conversation Tree
171
172 For readability, only a subset of the message properties is shown here.
173
174 ```json
175 {
176 "message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
177 "tree_state": "ready_for_export",
178 "prompt": {
179 "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
180 "text": "Why can't we divide by 0? (..)",
181 "role": "prompter",
182 "lang": "en",
183 "replies": [
184 {
185 "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
186 "text": "The reason we cannot divide by zero is because (..)",
187 "role": "assistant",
188 "lang": "en",
189 "replies": [
190 // ...
191 ]
192 },
193 {
194 "message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
195 "text": "The reason that the result of a division by zero is (..)",
196 "role": "assistant",
197 "lang": "en",
198 "replies": [
199 {
200 "message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
201 "text": "Math is confusing. Like those weird Irrational (..)",
202 "role": "prompter",
203 "lang": "en",
204 "replies": [
205 {
206 "message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
207 "text": "Irrational numbers are simply numbers (..)",
208 "role": "assistant",
209 "lang": "en",
210 "replies": []
211 },
212 // ...
213 ]
214 }
215 ]
216 }
217 ]
218 }
219 }
220 ```
221
222 Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
223 details about the data structure and Python code to read and write jsonl files containing oasst data objects.
224
225
226 ## Main Dataset Files
227
228 Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
229 or as a flat list (table) of messages (extension `.messages.jsonl.gz`).
230
231
232 ### Ready For Export Trees
233
234 ```
235 2023-11-05_oasst2_ready.trees.jsonl.gz 13,854 trees with 135,174 total messages
236 2023-11-05_oasst2_ready.messages.jsonl.gz 135,174 messages
237 ```
238
239 #### 2023-11-05_oasst2_ready.trees.jsonl.gz Stats
240 ```
241 Trees : 13,854
242 Messages : 135,174
243 Oldest message : 2023-01-16 20:24:26.211711+00:00
244 Youngest message : 2023-11-04 15:23:03.239343+00:00
245 Detoxify ratings : 111,448
246 Accepted messages: 129,517
247 Deleted messages : 4,376
248 Tree counts by state:
249 - ready_for_export: 13,854
250 Message counts by language:
251 - en: 64,513
252 - es: 28,199
253 - ru: 13,935
254 - zh: 8,615
255 - de: 6,145
256 - fr: 3,880
257 - pt-BR: 2,699
258 - th: 1,560
259 - ca: 1,283
260 - it: 943
261 - uk-UA: 845
262 - ja: 788
263 - pl: 435
264 - eo: 295
265 - eu: 274
266 - vi: 207
267 - fi: 138
268 - hu: 113
269 - ar: 80
270 - nl: 72
271 - da: 44
272 - tr: 37
273 - ko: 24
274 - he: 24
275 - id: 12
276 - cs: 12
277 - bn: 1
278 - sv: 1
279 ```
280
281 Trees in ready_for_export state without spam and deleted messages including message labels. The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
282
283 ### All Trees
284
285 ```
286 2023-11-05_oasst2_all.trees.jsonl.gz 70,642 trees with 208,584 total messages
287 2023-11-05_oasst2_all.messages.jsonl.gz 208,584 messages
288 ```
289
290 All trees, including those in states prompt_lottery_waiting (trees that consist of only one message, namely the initial prompt), aborted_low_grade (trees that stopped growing because the messages had low quality), and halted_by_moderator.
291
292 #### 2023-11-05_oasst2_all.trees.jsonl.gz Stats
293
294 ```
295 Trees : 70,642
296 Messages : 208,584
297 Oldest message : 2023-01-16 20:24:26.211711+00:00
298 Youngest message : 2023-11-05 10:24:44.484910+00:00
299 Detoxify ratings : 156,570
300 Accepted messages: 189,288
301 Deleted messages : 5,414
302 Tree counts by state:
303 - ready_for_export: 13,854
304 - prompt_lottery_waiting: 44,550
305 - halted_by_moderator: 3,089
306 - initial_prompt_review: 4,319
307 - growing: 3,102
308 - aborted_low_grade: 1,708
309 - ranking: 20
310 Message counts by language:
311 - en: 85,115
312 - es: 47,513
313 - ru: 15,990
314 - zh: 11,205
315 - de: 8,398
316 - fr: 5,841
317 - pt-BR: 4,540
318 - th: 3,236
319 - ca: 2,586
320 - it: 2,144
321 - ja: 1,904
322 - uk-UA: 1,889
323 - ko: 1,635
324 - pl: 1,510
325 - eo: 1,405
326 - nl: 1,354
327 - ar: 1,274
328 - vi: 1,137
329 - fi: 1,098
330 - eu: 995
331 - hu: 961
332 - tr: 803
333 - sv: 763
334 - id: 669
335 - gl: 574
336 - da: 502
337 - he: 498
338 - cs: 476
339 - ro: 434
340 - sk: 410
341 - fa: 394
342 - el: 388
343 - bar: 217
344 - nb-NO: 196
345 - bg: 176
346 - bn: 128
347 - sl: 119
348 - sr: 63
349 - swg: 23
350 - hi: 14
351 - lt: 7
352 ```
353
354 ### Supplemental Exports: Spam & Prompts
355
356 ```
357 2023-11-05_oasst2_spam.messages.jsonl.gz 19,296 matching messages
358 ```
359
360 These are messages which were deleted or have a negative review result ("review_result": false). Besides low quality, a frequent reason for message deletion is a wrong language tag.
361
362 ```
363 2023-11-05_oasst2_prompts.messages.jsonl.gz 64,592 matching messages
364 ```
365
366 These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
367
368 ### Using the Huggingface Datasets
369
370 While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
371 Nevertheless, we make all messages which can also be found in the file `2023-11-05_oasst2_ready.messages.jsonl.gz` available in parquet format as train/validation splits.
372 These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).
373
374 To load the oasst2 train & validation splits use:
375
376 ```python
377 from datasets import load_dataset
378 ds = load_dataset("OpenAssistant/oasst2")
379 train = ds['train'] # len(train)=128575 (95%)
380 val = ds['validation'] # len(val)=6599 (5%)
381 ```
382
383 The messages appear in depth-first order of the message trees.
384
385 Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
386 and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
387 and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.
388
389 ### Data Visualisation
390
391 Explore the content of the prompts from the English subset using [Bunka](https://github.com/charlesdedampierre/BunkaTopics) open-source visualization technology.
392 The interactive map [available on a HF space](https://huggingface.co/spaces/bunkalab/visualisation-oasst2) allows to explore each datapoint to get a more precise overview of the contents.
393
394 <a href="https://i.imgur.com/B2H8LR3.png">
395 <img src="https://i.imgur.com/B2H8LR3.png" alt="Bunka oasst2 Map" width="35%"/>
396 </a>
397
398 ## Contact
399
400 - Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
401 - GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
402 - E-Mail: [open-assistant@laion.ai](mailto:open-assistant@laion.ai)
403