Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 58k • 164 speechcolab/gigaspeech Viewer • Updated Nov 23, 2023 • 364k • 23.1k • 135 keithito/lj_speech Updated Aug 14, 2024 • 814 • 57 legacy-datasets/common_voice Updated Aug 22, 2024 • 3.6k • 141
text-intruction-data text-instructino-data Open-Orca/OpenOrca Viewer • Updated Feb 19 • 2.94M • 9.24k • 1.46k Open-Orca/SlimOrca-Dedup Viewer • Updated May 19 • 363k • 562 • 87 Open-Orca/SlimOrca Viewer • Updated Oct 12, 2023 • 518k • 2.09k • 287 argilla/distilabel-intel-orca-dpo-pairs Viewer • Updated Aug 7 • 12.9k • 2.78k • 181
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 26.9k • 152 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 2.17k • 42 allenai/dolma Updated Apr 17, 2024 • 1.35k • 958 allenai/peS2o Updated Oct 13, 2024 • 2.16k • 183
Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 58k • 164 speechcolab/gigaspeech Viewer • Updated Nov 23, 2023 • 364k • 23.1k • 135 keithito/lj_speech Updated Aug 14, 2024 • 814 • 57 legacy-datasets/common_voice Updated Aug 22, 2024 • 3.6k • 141
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 26.9k • 152 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 2.17k • 42 allenai/dolma Updated Apr 17, 2024 • 1.35k • 958 allenai/peS2o Updated Oct 13, 2024 • 2.16k • 183
text-intruction-data text-instructino-data Open-Orca/OpenOrca Viewer • Updated Feb 19 • 2.94M • 9.24k • 1.46k Open-Orca/SlimOrca-Dedup Viewer • Updated May 19 • 363k • 562 • 87 Open-Orca/SlimOrca Viewer • Updated Oct 12, 2023 • 518k • 2.09k • 287 argilla/distilabel-intel-orca-dpo-pairs Viewer • Updated Aug 7 • 12.9k • 2.78k • 181