Carlo Moro's picture

Carlo Moro

cnmoro

AI & ML interests

None yet

Recent Activity

liked a model 3 days ago
katanemo/Arch-Router-1.5B
liked a model 7 days ago
amd/Nitro-E
reacted to nouamanetazi's post with šŸ‘ 7 days ago
After training š’š¦šØš„š‹šŒšŸ‘ on šŸ‘šŸ–šŸ’ š‡šŸšŸŽšŸŽš¬ for nearly a month, I've come to realize something most people overlook: š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š¢š¬ š­š”šž š¦ššš¤šž-šØš«-š›š«šžššš¤ šŸšššœš­šØš« š¢š§ š‹š‹šŒ š­š«ššš¢š§š¢š§š . šŸ”„ Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious šš‚š‚š‹ šžš«š«šØš«š¬, or when your expensive GPU cluster is running at šŸ”šŸŽ% šžšŸšŸš¢šœš¢šžš§šœš², the problem isn't your model. It's most probably a š¦š¢š¬š®š¬šž šØšŸ š­š”šž š”ššš«šš°ššš«šž. šŸ› ļø Questions that seemed simple but had no clear answers: Why is šŒšØš„ š­š«ššš¢š§š¢š§š  š¬š„šØš°šžš« š­š”ššš§ ššžš§š¬šž š¦šØššžš„š¬? Which šš‚š‚š‹ šŸš„ššš š¬ should we actually set? How often should we checkpoint without killing throughput? That's why we built š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤ šŸ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the š¢š§šŸš«ššš¬š­š«š®šœš­š®š«šž š„ššš²šžš« that most teams get wrong. We validated real vs theoretical bandwidth across the entire stack: š‡ššŒšŸ‘ š”š¢š­š­š¢š§š  šŸ‘ š“š/š¬, šš•š‹š¢š§š¤ šŸ’.šŸŽ š«šžšššœš”š¢š§š  šŸ•šŸ–šŸ” š†š/š¬, šš‚šˆšž š†šžš§šŸ’ ššš­ šŸšŸ’.šŸ š†š/š¬. Then we ran collective operations across šŸšŸšŸ– š†šš”š¬ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from šŸ’šŸ–šŸŽ š†š/š¬ on a single node to šŸ‘šŸšŸŽ-šŸ‘šŸ“šŸŽ š†š/š¬ across 16 nodes. If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging. š“š”šž š’š¦šØš„ š“š«ššš¢š§š¢š§š  šš„ššš²š›šØšØš¤: https://lnkd.in/e5MKXUHS Shared with ā¤ļø by the HuggingFace team
View all activity

Organizations

Wise Intelligence's profile picture Smol Community's profile picture