# Long Sequence Performance
- The table below shows the pre-training performance of the LLAMA2-7B and LLAMA3-8B models on H100 and B200 GPUs, respectively, with CP (context parallelism),
and compares it against the results without CP at various input sequence lengths.
The detailed model-parallel configurations and the achieved performance are shown in the training results with CP.
In non-CP training runs, we use the most performant model- and data-parallel configurations without CP given the memory capacity constraint of the each GPU system.
## LLAMA3-8B (FP8) - B200
- Container: [NeMo25.04.rc2](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags)
- System: DGX-B200
| SeqLen (K) |
# of GPUs |
Batch Size |
Without CP |
With CP |
Speedup with CP/without CP |
| TFLOPS / GPU |
TP |
PP |
DP |
CP |
TFLOPS / GPU |
| 8 |
8 |
512 |
1,671 |
1 |
1 |
2 |
1 |
1,671 |
1.00 |
| 16 |
16 |
256 |
1,717 |
1 |
1 |
4 |
1 |
1,717 |
1.00 |
| 32 |
32 |
128 |
1,549 |
1 |
1 |
4 |
2 |
1,624 |
1.05 |
| 64 |
64 |
64 |
1,481 |
1 |
1 |
4 |
4 |
1,600 |
1.08 |
| 128 |
128 |
32 |
1,438 |
2 |
1 |
4 |
4 |
1,588 |
1.10 |
| 256 |
256 |
16 |
1,162 |
4 |
1 |
4 |
4 |
1,590 |
1.37 |
| 512 |
512 |
8 |
607 |
4 |
1 |
4 |
8 |
1,619 |
2.67 |
| 1024 |
1024 |
4 |
-1) |
4 |
1 |
4 |
16 |
1,608 |
- |
1) Since the maximum TP size is limited by the number of query groups (8 in LLAMA3-8B),
even with full activation recomputation it is impossible to run the LLAMA3-8B model on a 1024K token sequence without CP due to the GPU memory constraints.
## LLAMA2-7B (FP8) - H100
- Container: [NeMo24.03.01.framework](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo/tags)
- System: DGX-H100
| SeqLen (K) |
# of GPUs |
Batch Size |
Without CP |
With CP |
Speedup with CP/without CP |
| TFLOPS / GPU |
TP |
PP |
DP |
CP |
TFLOPS / GPU |
| 4 |
4 |
1024 |
768 |
1 |
1 |
4 |
1 |
768 |
1.00 |
| 8 |
8 |
512 |
730 |
1 |
2 |
4 |
1 |
730 |
1.00 |
| 16 |
16 |
256 |
660 |
2 |
1 |
8 |
1 |
660 |
1.00 |
| 32 |
32 |
128 |
595 |
2 |
1 |
8 |
2 |
610 |
1.03 |
| 64 |
64 |
64 |
534 |
4 |
1 |
8 |
2 |
574 |
1.07 |
| 128 |
128 |
32 |
424 |
4 |
1 |
8 |
4 |
555 |
1.31 |
| 256 |
256 |
16 |
392 |
4 |
1 |
8 |
8 |
549 |
1.40 |
| 512 |
512 |
8 |
104 |
8 |
1 |
4 |
16 |
549 |
5.28 |
| 1024 |
1024 |
4 |
26.5 |
8 |
1 |
4 |
32 |
536 |
20.23 |