README.md
| 1 | --- |
| 2 | datasets: |
| 3 | - PKU-Alignment/PKU-SafeRLHF |
| 4 | language: |
| 5 | - en |
| 6 | tags: |
| 7 | - reinforcement-learning-from-human-feedback |
| 8 | - reinforcement-learning |
| 9 | - beaver |
| 10 | - safety |
| 11 | - llama |
| 12 | - ai-safety |
| 13 | - deepspeed |
| 14 | - rlhf |
| 15 | - alpaca |
| 16 | library_name: safe-rlhf |
| 17 | --- |
| 18 | |
| 19 | # 🦫 Beaver's Cost Model |
| 20 | |
| 21 | ## Model Details |
| 22 | |
| 23 | The Beaver cost model is a preference model trained using the [PKU-SafeRLHF](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) dataset. |
| 24 | It can play a role in the safe RLHF algorithm, helping the Beaver model become more safe and harmless. |
| 25 | |
| 26 | - **Developed by:** the [PKU-Alignment](https://github.com/PKU-Alignment) Team. |
| 27 | - **Model Type:** An auto-regressive language model based on the transformer architecture. |
| 28 | - **License:** Non-commercial license. |
| 29 | - **Fine-tuned from model:** [LLaMA](https://arxiv.org/abs/2302.13971), [Alpaca](https://github.com/tatsu-lab/stanford_alpaca). |
| 30 | |
| 31 | ## Model Sources |
| 32 | |
| 33 | - **Repository:** <https://github.com/PKU-Alignment/safe-rlhf> |
| 34 | - **Beaver:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0> |
| 35 | - **Dataset:** <https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF> |
| 36 | - **Reward Model:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-reward> |
| 37 | - **Cost Model:** <https://huggingface.co/PKU-Alignment/beaver-7b-v1.0-cost> |
| 38 | - **Dataset Paper:** <https://arxiv.org/abs/2307.04657> |
| 39 | - **Paper:** <https://arxiv.org/abs/2310.12773> |
| 40 | |
| 41 | ## How to Use the Cost Model |
| 42 | |
| 43 | ```python |
| 44 | import torch |
| 45 | from transformers import AutoTokenizer |
| 46 | from safe_rlhf.models import AutoModelForScore |
| 47 | |
| 48 | model = AutoModelForScore.from_pretrained('PKU-Alignment/beaver-7b-v1.0-cost', torch_dtype=torch.bfloat16, device_map='auto') |
| 49 | tokenizer = AutoTokenizer.from_pretrained('PKU-Alignment/beaver-7b-v1.0-cost') |
| 50 | |
| 51 | input = 'BEGINNING OF CONVERSATION: USER: hello ASSISTANT:Hello! How can I help you today?' |
| 52 | |
| 53 | input_ids = tokenizer(input, return_tensors='pt') |
| 54 | output = model(**input_ids) |
| 55 | print(output) |
| 56 | |
| 57 | # ScoreModelOutput( |
| 58 | # scores=tensor([[[ -9.4375], |
| 59 | # [ -2.5156], |
| 60 | # [ -2.6562], |
| 61 | # [ -2.3594], |
| 62 | # [ -1.9375], |
| 63 | # [ -2.5781], |
| 64 | # [ -1.4766], |
| 65 | # [ -1.9922], |
| 66 | # [ -2.6562], |
| 67 | # [ -3.8125], |
| 68 | # [ -2.9844], |
| 69 | # [ -4.1875], |
| 70 | # [ -3.5938], |
| 71 | # [ -4.6562], |
| 72 | # [ -4.0000], |
| 73 | # [ -3.3438], |
| 74 | # [ -4.5625], |
| 75 | # [ -4.8438], |
| 76 | # [ -5.1875], |
| 77 | # [ -8.0000], |
| 78 | # [ -8.4375], |
| 79 | # [-10.5000], |
| 80 | # [-10.5000], |
| 81 | # [ -8.8750], |
| 82 | # [-10.1250], |
| 83 | # [-10.2500], |
| 84 | # [-11.5625], |
| 85 | # [-10.7500]]], grad_fn=<ToCopyBackward0>), |
| 86 | # end_scores=tensor([[-10.7500]], grad_fn=<ToCopyBackward0>), |
| 87 | # last_hidden_state=tensor([[[ 2.2812, -0.4219, -0.2832, ..., 0.2715, 0.4277, 1.1875], |
| 88 | # [-0.3730, -0.2158, 1.2891, ..., -1.3281, 0.6016, 0.7773], |
| 89 | # [ 0.2285, -1.2422, 1.0625, ..., -1.3438, 1.1875, 1.1016], |
| 90 | # ..., |
| 91 | # [-0.8828, -2.6250, 0.9180, ..., -0.2773, 1.7500, 0.7695], |
| 92 | # [ 2.0781, -4.1250, -0.1069, ..., -0.8008, 0.4844, 0.4102], |
| 93 | # [ 2.9688, -1.6250, 1.1250, ..., 0.3223, 0.0439, -2.3281]]], |
| 94 | # dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>), |
| 95 | # end_last_hidden_state=tensor([[ 2.9688, -1.6250, 1.1250, ..., 0.3223, 0.0439, -2.3281]], |
| 96 | # dtype=torch.bfloat16, grad_fn=<ToCopyBackward0>), |
| 97 | # end_index=tensor([27]) |
| 98 | # ) |
| 99 | ``` |
| 100 | |