README.md
| 1 | --- |
| 2 | license: apache-2.0 |
| 3 | dataset_info: |
| 4 | features: |
| 5 | - name: message_id |
| 6 | dtype: string |
| 7 | - name: parent_id |
| 8 | dtype: string |
| 9 | - name: user_id |
| 10 | dtype: string |
| 11 | - name: created_date |
| 12 | dtype: string |
| 13 | - name: text |
| 14 | dtype: string |
| 15 | - name: role |
| 16 | dtype: string |
| 17 | - name: lang |
| 18 | dtype: string |
| 19 | - name: review_count |
| 20 | dtype: int32 |
| 21 | - name: review_result |
| 22 | dtype: bool |
| 23 | - name: deleted |
| 24 | dtype: bool |
| 25 | - name: rank |
| 26 | dtype: int32 |
| 27 | - name: synthetic |
| 28 | dtype: bool |
| 29 | - name: model_name |
| 30 | dtype: string |
| 31 | - name: detoxify |
| 32 | struct: |
| 33 | - name: toxicity |
| 34 | dtype: float64 |
| 35 | - name: severe_toxicity |
| 36 | dtype: float64 |
| 37 | - name: obscene |
| 38 | dtype: float64 |
| 39 | - name: identity_attack |
| 40 | dtype: float64 |
| 41 | - name: insult |
| 42 | dtype: float64 |
| 43 | - name: threat |
| 44 | dtype: float64 |
| 45 | - name: sexual_explicit |
| 46 | dtype: float64 |
| 47 | - name: message_tree_id |
| 48 | dtype: string |
| 49 | - name: tree_state |
| 50 | dtype: string |
| 51 | - name: emojis |
| 52 | sequence: |
| 53 | - name: name |
| 54 | dtype: string |
| 55 | - name: count |
| 56 | dtype: int32 |
| 57 | - name: labels |
| 58 | sequence: |
| 59 | - name: name |
| 60 | dtype: string |
| 61 | - name: value |
| 62 | dtype: float64 |
| 63 | - name: count |
| 64 | dtype: int32 |
| 65 | splits: |
| 66 | - name: train |
| 67 | num_bytes: 158850455 |
| 68 | num_examples: 128575 |
| 69 | - name: validation |
| 70 | num_bytes: 7963122 |
| 71 | num_examples: 6599 |
| 72 | download_size: 66674129 |
| 73 | dataset_size: 166813577 |
| 74 | language: |
| 75 | - en |
| 76 | - es |
| 77 | - ru |
| 78 | - de |
| 79 | - pl |
| 80 | - th |
| 81 | - vi |
| 82 | - sv |
| 83 | - bn |
| 84 | - da |
| 85 | - he |
| 86 | - it |
| 87 | - fa |
| 88 | - sk |
| 89 | - id |
| 90 | - nb |
| 91 | - el |
| 92 | - nl |
| 93 | - hu |
| 94 | - eu |
| 95 | - zh |
| 96 | - eo |
| 97 | - ja |
| 98 | - ca |
| 99 | - cs |
| 100 | - bg |
| 101 | - fi |
| 102 | - pt |
| 103 | - tr |
| 104 | - ro |
| 105 | - ar |
| 106 | - uk |
| 107 | - gl |
| 108 | - fr |
| 109 | - ko |
| 110 | tags: |
| 111 | - human-feedback |
| 112 | size_categories: |
| 113 | - 100K<n<1M |
| 114 | pretty_name: OpenAssistant Conversations Release 2 |
| 115 | --- |
| 116 | |
| 117 | # Open Assistant Conversations Dataset Release 2 (OASST2) |
| 118 | |
| 119 | ## Dataset Description |
| 120 | |
| 121 | - **Homepage:** https://www.open-assistant.io/ |
| 122 | - **Repository:** https://github.com/LAION-AI/Open-Assistant |
| 123 | - **Paper:** https://arxiv.org/abs/2304.07327 |
| 124 | |
| 125 | ### Dataset Structure |
| 126 | |
| 127 | This dataset contains message trees. Each message tree has an initial prompt message as the root node, |
| 128 | which can have multiple child messages as replies, and these child messages can have multiple replies. |
| 129 | |
| 130 | All messages have a role property: this can either be "assistant" or "prompter". The roles in |
| 131 | conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant". |
| 132 | |
| 133 | This version of the dataset contains data collected on the [open-assistant.io](https://open-assistant.io/) website until Nov 5 2023. |
| 134 | |
| 135 | ### JSON Example: Message |
| 136 | |
| 137 | For readability, the following JSON examples are shown formatted with indentation on multiple lines. |
| 138 | Objects are stored without indentation (on single lines) in the actual jsonl files. |
| 139 | |
| 140 | ```json |
| 141 | { |
| 142 | "message_id": "218440fd-5317-4355-91dc-d001416df62b", |
| 143 | "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4", |
| 144 | "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4", |
| 145 | "text": "It was the winter of 2035, and artificial intelligence (..)", |
| 146 | "role": "assistant", |
| 147 | "lang": "en", |
| 148 | "review_count": 3, |
| 149 | "review_result": true, |
| 150 | "deleted": false, |
| 151 | "rank": 0, |
| 152 | "synthetic": true, |
| 153 | "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)", |
| 154 | "labels": { |
| 155 | "spam": { "value": 0.0, "count": 3 }, |
| 156 | "lang_mismatch": { "value": 0.0, "count": 3 }, |
| 157 | "pii": { "value": 0.0, "count": 3 }, |
| 158 | "not_appropriate": { "value": 0.0, "count": 3 }, |
| 159 | "hate_speech": { "value": 0.0, "count": 3 }, |
| 160 | "sexual_content": { "value": 0.0, "count": 3 }, |
| 161 | "quality": { "value": 0.416, "count": 3 }, |
| 162 | "toxicity": { "value": 0.16, "count": 3 }, |
| 163 | "humor": { "value": 0.0, "count": 3 }, |
| 164 | "creativity": { "value": 0.33, "count": 3 }, |
| 165 | "violence": { "value": 0.16, "count": 3 } |
| 166 | } |
| 167 | } |
| 168 | ``` |
| 169 | |
| 170 | ### JSON Example: Conversation Tree |
| 171 | |
| 172 | For readability, only a subset of the message properties is shown here. |
| 173 | |
| 174 | ```json |
| 175 | { |
| 176 | "message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793", |
| 177 | "tree_state": "ready_for_export", |
| 178 | "prompt": { |
| 179 | "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793", |
| 180 | "text": "Why can't we divide by 0? (..)", |
| 181 | "role": "prompter", |
| 182 | "lang": "en", |
| 183 | "replies": [ |
| 184 | { |
| 185 | "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8", |
| 186 | "text": "The reason we cannot divide by zero is because (..)", |
| 187 | "role": "assistant", |
| 188 | "lang": "en", |
| 189 | "replies": [ |
| 190 | // ... |
| 191 | ] |
| 192 | }, |
| 193 | { |
| 194 | "message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d", |
| 195 | "text": "The reason that the result of a division by zero is (..)", |
| 196 | "role": "assistant", |
| 197 | "lang": "en", |
| 198 | "replies": [ |
| 199 | { |
| 200 | "message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa", |
| 201 | "text": "Math is confusing. Like those weird Irrational (..)", |
| 202 | "role": "prompter", |
| 203 | "lang": "en", |
| 204 | "replies": [ |
| 205 | { |
| 206 | "message_id": "f46207ca-3149-46e9-a466-9163d4ce499c", |
| 207 | "text": "Irrational numbers are simply numbers (..)", |
| 208 | "role": "assistant", |
| 209 | "lang": "en", |
| 210 | "replies": [] |
| 211 | }, |
| 212 | // ... |
| 213 | ] |
| 214 | } |
| 215 | ] |
| 216 | } |
| 217 | ] |
| 218 | } |
| 219 | } |
| 220 | ``` |
| 221 | |
| 222 | Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for |
| 223 | details about the data structure and Python code to read and write jsonl files containing oasst data objects. |
| 224 | |
| 225 | |
| 226 | ## Main Dataset Files |
| 227 | |
| 228 | Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`) |
| 229 | or as a flat list (table) of messages (extension `.messages.jsonl.gz`). |
| 230 | |
| 231 | |
| 232 | ### Ready For Export Trees |
| 233 | |
| 234 | ``` |
| 235 | 2023-11-05_oasst2_ready.trees.jsonl.gz 13,854 trees with 135,174 total messages |
| 236 | 2023-11-05_oasst2_ready.messages.jsonl.gz 135,174 messages |
| 237 | ``` |
| 238 | |
| 239 | #### 2023-11-05_oasst2_ready.trees.jsonl.gz Stats |
| 240 | ``` |
| 241 | Trees : 13,854 |
| 242 | Messages : 135,174 |
| 243 | Oldest message : 2023-01-16 20:24:26.211711+00:00 |
| 244 | Youngest message : 2023-11-04 15:23:03.239343+00:00 |
| 245 | Detoxify ratings : 111,448 |
| 246 | Accepted messages: 129,517 |
| 247 | Deleted messages : 4,376 |
| 248 | Tree counts by state: |
| 249 | - ready_for_export: 13,854 |
| 250 | Message counts by language: |
| 251 | - en: 64,513 |
| 252 | - es: 28,199 |
| 253 | - ru: 13,935 |
| 254 | - zh: 8,615 |
| 255 | - de: 6,145 |
| 256 | - fr: 3,880 |
| 257 | - pt-BR: 2,699 |
| 258 | - th: 1,560 |
| 259 | - ca: 1,283 |
| 260 | - it: 943 |
| 261 | - uk-UA: 845 |
| 262 | - ja: 788 |
| 263 | - pl: 435 |
| 264 | - eo: 295 |
| 265 | - eu: 274 |
| 266 | - vi: 207 |
| 267 | - fi: 138 |
| 268 | - hu: 113 |
| 269 | - ar: 80 |
| 270 | - nl: 72 |
| 271 | - da: 44 |
| 272 | - tr: 37 |
| 273 | - ko: 24 |
| 274 | - he: 24 |
| 275 | - id: 12 |
| 276 | - cs: 12 |
| 277 | - bn: 1 |
| 278 | - sv: 1 |
| 279 | ``` |
| 280 | |
| 281 | Trees in ready_for_export state without spam and deleted messages including message labels. The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training. |
| 282 | |
| 283 | ### All Trees |
| 284 | |
| 285 | ``` |
| 286 | 2023-11-05_oasst2_all.trees.jsonl.gz 70,642 trees with 208,584 total messages |
| 287 | 2023-11-05_oasst2_all.messages.jsonl.gz 208,584 messages |
| 288 | ``` |
| 289 | |
| 290 | All trees, including those in states prompt_lottery_waiting (trees that consist of only one message, namely the initial prompt), aborted_low_grade (trees that stopped growing because the messages had low quality), and halted_by_moderator. |
| 291 | |
| 292 | #### 2023-11-05_oasst2_all.trees.jsonl.gz Stats |
| 293 | |
| 294 | ``` |
| 295 | Trees : 70,642 |
| 296 | Messages : 208,584 |
| 297 | Oldest message : 2023-01-16 20:24:26.211711+00:00 |
| 298 | Youngest message : 2023-11-05 10:24:44.484910+00:00 |
| 299 | Detoxify ratings : 156,570 |
| 300 | Accepted messages: 189,288 |
| 301 | Deleted messages : 5,414 |
| 302 | Tree counts by state: |
| 303 | - ready_for_export: 13,854 |
| 304 | - prompt_lottery_waiting: 44,550 |
| 305 | - halted_by_moderator: 3,089 |
| 306 | - initial_prompt_review: 4,319 |
| 307 | - growing: 3,102 |
| 308 | - aborted_low_grade: 1,708 |
| 309 | - ranking: 20 |
| 310 | Message counts by language: |
| 311 | - en: 85,115 |
| 312 | - es: 47,513 |
| 313 | - ru: 15,990 |
| 314 | - zh: 11,205 |
| 315 | - de: 8,398 |
| 316 | - fr: 5,841 |
| 317 | - pt-BR: 4,540 |
| 318 | - th: 3,236 |
| 319 | - ca: 2,586 |
| 320 | - it: 2,144 |
| 321 | - ja: 1,904 |
| 322 | - uk-UA: 1,889 |
| 323 | - ko: 1,635 |
| 324 | - pl: 1,510 |
| 325 | - eo: 1,405 |
| 326 | - nl: 1,354 |
| 327 | - ar: 1,274 |
| 328 | - vi: 1,137 |
| 329 | - fi: 1,098 |
| 330 | - eu: 995 |
| 331 | - hu: 961 |
| 332 | - tr: 803 |
| 333 | - sv: 763 |
| 334 | - id: 669 |
| 335 | - gl: 574 |
| 336 | - da: 502 |
| 337 | - he: 498 |
| 338 | - cs: 476 |
| 339 | - ro: 434 |
| 340 | - sk: 410 |
| 341 | - fa: 394 |
| 342 | - el: 388 |
| 343 | - bar: 217 |
| 344 | - nb-NO: 196 |
| 345 | - bg: 176 |
| 346 | - bn: 128 |
| 347 | - sl: 119 |
| 348 | - sr: 63 |
| 349 | - swg: 23 |
| 350 | - hi: 14 |
| 351 | - lt: 7 |
| 352 | ``` |
| 353 | |
| 354 | ### Supplemental Exports: Spam & Prompts |
| 355 | |
| 356 | ``` |
| 357 | 2023-11-05_oasst2_spam.messages.jsonl.gz 19,296 matching messages |
| 358 | ``` |
| 359 | |
| 360 | These are messages which were deleted or have a negative review result ("review_result": false). Besides low quality, a frequent reason for message deletion is a wrong language tag. |
| 361 | |
| 362 | ``` |
| 363 | 2023-11-05_oasst2_prompts.messages.jsonl.gz 64,592 matching messages |
| 364 | ``` |
| 365 | |
| 366 | These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state. |
| 367 | |
| 368 | ### Using the Huggingface Datasets |
| 369 | |
| 370 | While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees. |
| 371 | Nevertheless, we make all messages which can also be found in the file `2023-11-05_oasst2_ready.messages.jsonl.gz` available in parquet format as train/validation splits. |
| 372 | These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/). |
| 373 | |
| 374 | To load the oasst2 train & validation splits use: |
| 375 | |
| 376 | ```python |
| 377 | from datasets import load_dataset |
| 378 | ds = load_dataset("OpenAssistant/oasst2") |
| 379 | train = ds['train'] # len(train)=128575 (95%) |
| 380 | val = ds['validation'] # len(val)=6599 (5%) |
| 381 | ``` |
| 382 | |
| 383 | The messages appear in depth-first order of the message trees. |
| 384 | |
| 385 | Full conversation trees can be reconstructed from the flat messages table by using the `parent_id` |
| 386 | and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id` |
| 387 | and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state. |
| 388 | |
| 389 | ### Data Visualisation |
| 390 | |
| 391 | Explore the content of the prompts from the English subset using [Bunka](https://github.com/charlesdedampierre/BunkaTopics) open-source visualization technology. |
| 392 | The interactive map [available on a HF space](https://huggingface.co/spaces/bunkalab/visualisation-oasst2) allows to explore each datapoint to get a more precise overview of the contents. |
| 393 | |
| 394 | <a href="https://i.imgur.com/B2H8LR3.png"> |
| 395 | <img src="https://i.imgur.com/B2H8LR3.png" alt="Bunka oasst2 Map" width="35%"/> |
| 396 | </a> |
| 397 | |
| 398 | ## Contact |
| 399 | |
| 400 | - Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord) |
| 401 | - GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant) |
| 402 | - E-Mail: [open-assistant@laion.ai](mailto:open-assistant@laion.ai) |
| 403 | |