| | --- |
| | id: CardioNER.nl_128xtokenWindow |
| | name: CardioNER.nl_128xtokenWindow |
| | description: >- |
| | CardioBERTa.nl_clinical finetuned for multilabel NER task with tokenwindow of |
| | 128 |
| | license: gpl-3.0 |
| | language: nl |
| | tags: |
| | - lexical semantic |
| | - span classification |
| | - science |
| | - biology |
| | - clinical ner |
| | - biomedical |
| | - ner,medical |
| | - bionlp |
| | base_model: UMCU/CardioBERTa.nl_clinical |
| | pipeline_tag: token-classification |
| | datasets: |
| | - DT4H/CardioCCC |
| | - UMCU/cardioccc_dutch |
| | --- |
| | |
| | # Model Card for Cardioner.nl 128 |
| |
|
| | This a UMCU/CardioBERTa.nl_clinical base model finetuned for span classification. For this model |
| | we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions |
| | over sequences. This specific model is trained on a batch of about 500 span-labeled documents. |
| | |
| | This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter. |
| | |
| | The training was performed with 10 fold CV, with weight averaging of the best epochs per fold. |
| | |
| | |
| | ### Expected input and output |
| | The input should be a string with **Dutch** clinical text related to **cardiology**. |
| | |
| | CardioNER.nl_128 is a multiclass span classification model. |
| | The classes that can be predicted are |
| | * **procedure**, |
| | * **medication**, |
| | * **disease**, |
| | * **symptom**. |
| |
|
| | #### Extracting span classification from CardioNER.nl_128xtokenWindow |
| | |
| | The following script converts a string of <128 tokens to a list of span predictions. |
| | ```python |
| | from transformers import pipeline |
| | |
| | le_pipe = pipeline('ner', |
| | model=model, |
| | tokenizer=model, aggregation_strategy="simple", |
| | device=-1) |
| | |
| | named_ents = le_pipe(SOME_TEXT) |
| | ``` |
| | |
| | To process a string of *arbitrary length* you can split the string into sentences or paragraphs |
| | using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe. |
| | You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None; |
| | ```python |
| | named_ents = le_pipe(SOME_TEXT, stride=256) |
| | ``` |
| |
|
| |
|
| |
|
| | # Data description |
| |
|
| | CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom |
| |
|
| |
|
| | # Acknowledgement |
| |
|
| | This is part of the [DT4H project](https://www.datatools4heart.eu/). |
| |
|
| | # Doi and reference |
| |
|
| |
|
| |
|
| | For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER). |
| | and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/) |