UMCU
/

CardioNER.nl_128

Token Classification

lexical semantic

span classification

Model card Files Files and versions

CardioNER.nl_128 / README.md

UMCU's picture

Update README.md

f20d142 verified 8 months ago

|

history blame contribute delete

2.65 kB

	---
	id: CardioNER.nl_128xtokenWindow
	name: CardioNER.nl_128xtokenWindow
	description: >-
	CardioBERTa.nl_clinical finetuned for multilabel NER task with tokenwindow of
	128
	license: gpl-3.0
	language: nl
	tags:
	- lexical semantic
	- span classification
	- science
	- biology
	- clinical ner
	- biomedical
	- ner,medical
	- bionlp
	base_model: UMCU/CardioBERTa.nl_clinical
	pipeline_tag: token-classification
	datasets:
	- DT4H/CardioCCC
	- UMCU/cardioccc_dutch
	---

	# Model Card for Cardioner.nl 128

	This a UMCU/CardioBERTa.nl_clinical base model finetuned for span classification. For this model
	we used IOB-tagging. Using the IOB-tagging schema facilitates the aggregation of predictions
	over sequences. This specific model is trained on a batch of about 500 span-labeled documents.

	This is version was trained with context windows of 128 tokens. For the chunking we used a paragraph-based splitter.

	The training was performed with 10 fold CV, with weight averaging of the best epochs per fold.


	### Expected input and output
	The input should be a string with Dutch clinical text related to cardiology.

	CardioNER.nl_128 is a multiclass span classification model.
	The classes that can be predicted are
	* procedure,
	* medication,
	* disease,
	* symptom.

	#### Extracting span classification from CardioNER.nl_128xtokenWindow

	The following script converts a string of <128 tokens to a list of span predictions.
	```python
	from transformers import pipeline

	le_pipe = pipeline('ner',
	model=model,
	tokenizer=model, aggregation_strategy="simple",
	device=-1)

	named_ents = le_pipe(SOME_TEXT)
	```

	To process a string of arbitrary length you can split the string into sentences or paragraphs
	using e.g. pysbd or spacy(sentencizer) and iteratively parse the list of with the span-classification pipe.
	You can also use the strider built in the transformer pipeline, although this is limited to non-overlapping strides plus it requires a FastTokenizer and it does not work for aggregation_strategy=None;
	```python
	named_ents = le_pipe(SOME_TEXT, stride=256)
	```



	# Data description

	CardioCCC; manually labeled cardiology discharge letters; procedure, medication, disease, symptom


	# Acknowledgement

	This is part of the [DT4H project](https://www.datatools4heart.eu/).

	# Doi and reference



	For more details about training/eval and other scripts, see CardioNER [github repo](https://github.com/DataTools4Heart/CardioNER).
	and for more information on the background, see Datatools4Heart [Huggingface](https://huggingface.co/DT4H)/[Website](https://www.datatools4heart.eu/)