README.md
| 1 | --- |
| 2 | license: afl-3.0 |
| 3 | --- |
| 4 | Hugging Face's logo |
| 5 | --- |
| 6 | language: |
| 7 | - ar |
| 8 | - de |
| 9 | - en |
| 10 | - es |
| 11 | - fr |
| 12 | - it |
| 13 | - lv |
| 14 | - nl |
| 15 | - pt |
| 16 | - zh |
| 17 | - multilingual |
| 18 | |
| 19 | --- |
| 20 | # xlm-roberta-large-ner-hrl |
| 21 | ## Model description |
| 22 | **xlm-roberta-large-ner-hrl** is a **Named Entity Recognition** model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned XLM-RoBERTa large model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER). |
| 23 | Specifically, this model is a *xlm-roberta-large* model that was fine-tuned on an aggregation of 10 high-resourced languages |
| 24 | ## Intended uses & limitations |
| 25 | #### How to use |
| 26 | You can use this model with Transformers *pipeline* for NER. |
| 27 | ```python |
| 28 | from transformers import AutoTokenizer, AutoModelForTokenClassification |
| 29 | from transformers import pipeline |
| 30 | tokenizer = AutoTokenizer.from_pretrained("Davlan/xlm-roberta-large-ner-hrl") |
| 31 | model = AutoModelForTokenClassification.from_pretrained("Davlan/xlm-roberta-large-ner-hrl") |
| 32 | nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
| 33 | example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute." |
| 34 | ner_results = nlp(example) |
| 35 | print(ner_results) |
| 36 | ``` |
| 37 | #### Limitations and bias |
| 38 | This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains. |
| 39 | ## Training data |
| 40 | The training data for the 10 languages are from: |
| 41 | |
| 42 | Language|Dataset |
| 43 | -|- |
| 44 | Arabic | [ANERcorp](https://camel.abudhabi.nyu.edu/anercorp/) |
| 45 | German | [conll 2003](https://www.clips.uantwerpen.be/conll2003/ner/) |
| 46 | English | [conll 2003](https://www.clips.uantwerpen.be/conll2003/ner/) |
| 47 | Spanish | [conll 2002](https://www.clips.uantwerpen.be/conll2002/ner/) |
| 48 | French | [Europeana Newspapers](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_FR.bnf.bio) |
| 49 | Italian | [Italian I-CAB](https://ontotext.fbk.eu/icab.html) |
| 50 | Latvian | [Latvian NER](https://github.com/LUMII-AILab/FullStack/tree/master/NamedEntities) |
| 51 | Dutch | [conll 2002](https://www.clips.uantwerpen.be/conll2002/ner/) |
| 52 | Portuguese |[Paramopama + Second Harem](https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese) |
| 53 | Chinese | [MSRA](https://huggingface.co/datasets/msra_ner) |
| 54 | |
| 55 | The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes: |
| 56 | Abbreviation|Description |
| 57 | -|- |
| 58 | O|Outside of a named entity |
| 59 | B-PER |Beginning of a person’s name right after another person’s name |
| 60 | I-PER |Person’s name |
| 61 | B-ORG |Beginning of an organisation right after another organisation |
| 62 | I-ORG |Organisation |
| 63 | B-LOC |Beginning of a location right after another location |
| 64 | I-LOC |Location |
| 65 | ## Training procedure |
| 66 | This model was trained on NVIDIA V100 GPU with recommended hyperparameters from HuggingFace code. |