README.md · xlm-roberta-large-ner-hrl

README.md

3.0 KB · 66 lines · markdown Raw

1	`---`
2	`license: afl-3.0`
3	`---`
4	`Hugging Face's logo`
5	`---`
6	`language:`
7	`- ar`
8	`- de`
9	`- en`
10	`- es`
11	`- fr`
12	`- it`
13	`- lv`
14	`- nl`
15	`- pt`
16	`- zh`
17	`- multilingual`
18
19	`---`
20	`# xlm-roberta-large-ner-hrl`
21	`## Model description`
22	`xlm-roberta-large-ner-hrl is a Named Entity Recognition model for 10 high resourced languages (Arabic, German, English, Spanish, French, Italian, Latvian, Dutch, Portuguese and Chinese) based on a fine-tuned XLM-RoBERTa large model. It has been trained to recognize three types of entities: location (LOC), organizations (ORG), and person (PER).`
23	`Specifically, this model is a xlm-roberta-large model that was fine-tuned on an aggregation of 10 high-resourced languages`
24	`## Intended uses & limitations`
25	`#### How to use`
26	`You can use this model with Transformers pipeline for NER.`
27	```python
28	`from transformers import AutoTokenizer, AutoModelForTokenClassification`
29	`from transformers import pipeline`
30	`tokenizer = AutoTokenizer.from_pretrained("Davlan/xlm-roberta-large-ner-hrl")`
31	`model = AutoModelForTokenClassification.from_pretrained("Davlan/xlm-roberta-large-ner-hrl")`
32	`nlp = pipeline("ner", model=model, tokenizer=tokenizer)`
33	`example = "Nader Jokhadar had given Syria the lead with a well-struck header in the seventh minute."`
34	`ner_results = nlp(example)`
35	`print(ner_results)`
36	```
37	`#### Limitations and bias`
38	`This model is limited by its training dataset of entity-annotated news articles from a specific span of time. This may not generalize well for all use cases in different domains.`
39	`## Training data`
40	`The training data for the 10 languages are from:`
41
42	`Language\|Dataset`
43	`-\|-`
44	`Arabic \| [ANERcorp](https://camel.abudhabi.nyu.edu/anercorp/)`
45	`German \| [conll 2003](https://www.clips.uantwerpen.be/conll2003/ner/)`
46	`English \| [conll 2003](https://www.clips.uantwerpen.be/conll2003/ner/)`
47	`Spanish \| [conll 2002](https://www.clips.uantwerpen.be/conll2002/ner/)`
48	`French \| [Europeana Newspapers](https://github.com/EuropeanaNewspapers/ner-corpora/tree/master/enp_FR.bnf.bio)`
49	`Italian \| [Italian I-CAB](https://ontotext.fbk.eu/icab.html)`
50	`Latvian \| [Latvian NER](https://github.com/LUMII-AILab/FullStack/tree/master/NamedEntities)`
51	`Dutch \| [conll 2002](https://www.clips.uantwerpen.be/conll2002/ner/)`
52	`Portuguese \|[Paramopama + Second Harem](https://github.com/davidsbatista/NER-datasets/tree/master/Portuguese)`
53	`Chinese \| [MSRA](https://huggingface.co/datasets/msra_ner)`
54
55	`The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:`
56	`Abbreviation\|Description`
57	`-\|-`
58	`O\|Outside of a named entity`
59	`B-PER \|Beginning of a person’s name right after another person’s name`
60	`I-PER \|Person’s name`
61	`B-ORG \|Beginning of an organisation right after another organisation`
62	`I-ORG \|Organisation`
63	`B-LOC \|Beginning of a location right after another location`
64	`I-LOC \|Location`
65	`## Training procedure`
66	`This model was trained on NVIDIA V100 GPU with recommended hyperparameters from HuggingFace code.`