Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
241.3
TFLOPS
11
3
55
Luca Di Liello
lucadiliello
Follow
lamaa's profile picture
NickyNicky's profile picture
OccasionallyNLP's profile picture
7 followers
·
5 following
https://lucadiliello.github.io
lucadiliello
AI & ML interests
Applied Scientist II in Amazon AGI
Recent Activity
reacted
to
lysandre
's
post
with 🚀
2 days ago
We're kick-starting the process of Transformers v5, with @ArthurZ and @cyrilvallez! v5 should be significant: we're using it as a milestone for performance optimizations, saner defaults, and a much cleaner code base worthy of 2025. Fun fact: v4.0.0-rc-1 came out on Nov 19, 2020, nearly five years ago!
reacted
to
Norod78
's
post
with 🔥
20 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
reacted
to
Norod78
's
post
with 👍
20 days ago
Multilingual Tokenization Showdown Analyzing 12 LLM Tokenizers Across 204 Languages. First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages: https://huggingface.co/datasets/Norod78/WikiCat-Multilingual For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language). You can see a slideshow summary of the results here: https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo: https://github.com/Norod/wikicat-tokenizer-eval Post on X: https://x.com/Norod78/status/1984366900550266999
View all activity
Organizations
None yet
Papers
6
arxiv:
2309.08272
arxiv:
2305.15358
arxiv:
2205.10455
arxiv:
2205.01228
Expand 6 papers
models
15
Sort: Recently updated
lucadiliello/bart-small
70.5M
•
Updated
Oct 6, 2023
•
1.24k
•
5
lucadiliello/opt-30b-deepspeed-inference-fp16-shard-4
Text Generation
•
Updated
Mar 22, 2023
•
1
lucadiliello/opt-30b-deepspeed-inference-fp16-shard-2
Text Generation
•
Updated
Mar 22, 2023
lucadiliello/opt-30b-deepspeed-inference-fp16-shard-8
Text Generation
•
Updated
Mar 22, 2023
•
8
lucadiliello/deberta-small
Fill-Mask
•
Updated
Feb 27, 2023
•
1
lucadiliello/bleurt-tiny-512
Text Classification
•
Updated
Jan 19, 2023
•
126
lucadiliello/bleurt-tiny-128
Text Classification
•
Updated
Jan 19, 2023
•
265
lucadiliello/bleurt-large-512
Text Classification
•
Updated
Jan 19, 2023
lucadiliello/bleurt-large-128
Text Classification
•
Updated
Jan 19, 2023
•
1
lucadiliello/bleurt-base-512
Text Classification
•
Updated
Jan 19, 2023
•
1
View 15 models
datasets
28
Sort: Recently updated
lucadiliello/STORIES
Viewer
•
Updated
Jul 18, 2023
•
947k
•
808
•
11
lucadiliello/fever
Viewer
•
Updated
Jul 17, 2023
•
185k
•
22
lucadiliello/cc_news
Viewer
•
Updated
Jun 20, 2023
•
150M
•
243
•
2
lucadiliello/hotpotqa
Viewer
•
Updated
Jun 6, 2023
•
78.8k
•
66
•
2
lucadiliello/newsqa
Viewer
•
Updated
Jun 6, 2023
•
78.4k
•
250
•
9
lucadiliello/bioasqqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
30
lucadiliello/duorc.paraphrasercqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
16
lucadiliello/naturalquestionsshortqa
Viewer
•
Updated
Jun 6, 2023
•
117k
•
70
•
3
lucadiliello/dropqa
Viewer
•
Updated
Jun 6, 2023
•
1.5k
•
20
•
3
lucadiliello/searchqa
Viewer
•
Updated
Jun 6, 2023
•
134k
•
128
•
1
View 28 datasets