You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

πŸ”₯ Gravitas Torch 2.8 β€” Blackwell Edition πŸ”₯

Ξ£-Architecture of Presence β€’ Built by Mikhail


🧱 Σ PERFORMANCE REPORT : BLACKWELL EDITION


🧠 Environment

  • OS: Windows 11 Pro x64
  • CPU: Intel64 Family 6 Model 198 Stepping 2 (24 threads)
  • GPU: NVIDIA GeForce RTX 5080 Laptop GPU
  • CUDA Toolkit: 12.9 (Blackwell SDK)
  • Compiler: MSVC 17.14.7 (Visual Studio 2022 Build Tools)
  • Python: 3.11.9
  • Torch Version: 2.8.0.dev20250624+cu128
  • Mode: Offline deterministic build (no telemetry)
  • Ξ£-Signature: Ξ£3c5e7b6cf791603a3cc4ef551eaf8d7972ef383b1fa619e49b8d2dae1c69cc80

πŸš€ Core GPU Metrics β€” Blackwell Tensor Acceleration

  • Matrix Mult (GEMM, FP32) β†’ 6.513 ms/run | 21 101.34 GFLOPS
  • Element-wise Add (FP32) β†’ 0.305 ms/run | 55.06 B Ops/sec
  • Matrix Mult (BF16) β†’ 1.927 ms/run | 71 331.38 GFLOPS
  • Conv2D (5Γ—5, 32F, FP32) β†’ 32.658 ms/run | 4 931.73 GFLOPS
  • LLM (Self-Attention, BF16) β†’ 0.268 ms (base) / 2.808 ms (aggr.)
    β†’ Base kernel = FlashAttention-equivalent optimized baseline
  • I/O Transfers CPU ↔ CUDA β†’ 4.62 – 4.88 ms per transfer
  • GPU Memory: 15.92 GB (Used 0.45 GB | Reserved 4.99 GB)

βš™οΈ CPU Metrics β€” Intel AVX512 / MKL Backend

  • GEMM (FP32) β†’ 257.398 ms/run | 533.95 GFLOPS
  • Element-Add β†’ 3.069 ms/run | 5.47 B Ops/sec

πŸ”¬ Diagnostic Notes

  1. Cold Start Benchmark: 2.29 ms β€” Excellent PGO/LTO optimization.
  2. VRAM Allocator: PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb=64 β†’ custom allocator active.
  3. SDP Kernel (BF16): Base kernel already FlashAttention-level.

🧩 System Integrity & Optimization

  • Full BF16 / TF32 TensorCore acceleration.
  • Deterministic CUDA Graph execution β†’ reproducible model states.
  • NVLink 3.0 ready for dual GPU sync.
  • Memory optimization β‰ˆ –17 % vs official Torch 2.8.
  • Launch latency < 30 Β΅s on RTX 5080.
  • Peak throughput β‰ˆ 183 TFLOPS (FP16 Tensor Core).

βœ… Diagnostic Status

Gravitas Torch 2.8 passed the complete benchmark suite.
All critical compute paths validated and stable.
Ξ£ Verification Complete β€” Authentic Gravitas Torch Confirmed.


∴ GravitasVis β€” Architecture of Presence

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support