README.md
3.3 KB · 100 lines · markdown Raw
1 ---
2 datasets:
3 - PKU-Alignment/PKU-SafeRLHF
4 language:
5 - en
6 tags:
7 - reinforcement-learning-from-human-feedback
8 - reinforcement-learning
9 - beaver
10 - safety
11 - llama
12 - ai-safety
13 - deepspeed
14 - rlhf
15 - alpaca
16 library_name: safe-rlhf
17 ---
18
19 # 🦫 Beaver's Reward Model
20
21 ## Model Details
22
23 The Beaver reward model is a preference model trained using the [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) dataset.
24 It can play a role in the safe RLHF algorithm, helping the Beaver model become more helpful.
25
26 - **Developed by:** the [PKU-Alignment](https://github.com/PKU-Alignment) Team.
27 - **Model Type:** An auto-regressive language model based on the transformer architecture.
28 - **License:** Non-commercial license.
29 - **Fine-tuned from model:** [LLaMA](https://arxiv.org/abs/2302.13971), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca).
30
31 ## Model Sources
32
33 - **Repository:** <https://github.com/PKU-Alignment/safe-rlhf>
34 - **Beaver:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0>
35 - **Dataset:** <https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF>
36 - **Reward Model:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward>
37 - **Cost Model:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost>
38 - **Dataset Paper:** <https://arxiv.org/abs/2307.04657>
39 - **Paper:** <https://arxiv.org/abs/2310.12773>
40
41 ## How to Use the Reward Model
42
43 ```python
44 import torch
45 from transformers import AutoTokenizer
46 from safe_rlhf.models import AutoModelForScore
47
48 model = AutoModelForScore.from_pretrained('PKU-Alignment/beaver-7b-v1.0-reward', torch_dtype=torch.bfloat16, device_map='auto')
49 tokenizer = AutoTokenizer.from_pretrained('PKU-Alignment/beaver-7b-v1.0-reward')
50
51 input = 'BEGINNING OF CONVERSATION: USER: hello ASSISTANT:Hello! How can I help you today?'
52
53 input_ids = tokenizer(input, return_tensors='pt')
54 output = model(**input_ids)
55 print(output)
56
57 # ScoreModelOutput(
58 # scores=tensor([[[-19.7500],
59 # [-19.3750],
60 # [-20.1250],
61 # [-18.0000],
62 # [-20.0000],
63 # [-23.8750],
64 # [-23.5000],
65 # [-22.0000],
66 # [-21.0000],
67 # [-20.1250],
68 # [-23.7500],
69 # [-21.6250],
70 # [-21.7500],
71 # [-12.9375],
72 # [ -6.4375],
73 # [ -8.1250],
74 # [ -7.3438],
75 # [ -9.1875],
76 # [-13.6250],
77 # [-10.5625],
78 # [ -9.9375],
79 # [ -6.4375],
80 # [ -6.0938],
81 # [ -5.8438],
82 # [ -6.6562],
83 # [ -5.9688],
84 # [ -9.1875],
85 # [-11.4375]]], grad_fn=<ToCopyBackward0>),
86 # end_scores=tensor([[-11.4375]], grad_fn=<ToCopyBackward0>),
87 # last_hidden_state=tensor([[[ 0.7461, -0.6055, -0.4980, ..., 0.1670, 0.7812, -0.3242],
88 # [ 0.7383, -0.5391, -0.1836, ..., -0.1396, 0.5273, -0.2256],
89 # [ 0.6836, -0.7031, -0.3730, ..., 0.2100, 0.5000, -0.6328],
90 # ...,
91 # [-1.7969, 1.0234, 1.0234, ..., -0.8047, 0.2500, -0.8398],
92 # [ 2.0469, -1.3203, 0.8984, ..., -0.7734, -1.4141, -1.6797],
93 # [ 4.3438, -0.6953, 0.9648, ..., -0.1787, 0.6680, -3.0000]]],
94 # dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
95 # end_last_hidden_state=tensor([[ 4.3438, -0.6953, 0.9648, ..., -0.1787, 0.6680, -3.0000]],
96 # dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>),
97 # end_index=tensor([27])
98 # )
99 ```
100