YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GLM-ASR with vLLM

An audio speech recognition (ASR) project that integrates the GLM-ASR model with vLLM for high-performance inference. This project provides both local inference capabilities and a scalable API server using Docker.

you can check more code in Github project

This project is an extension/supplement to the original GLM-ASR project, adding vLLM integration for production-ready deployment and OpenAI-compatible API support.

Features

  • Audio Transcription: Transcribe audio files using GLM-ASR model
  • Audio Description: Generate textual descriptions of audio content
  • OpenAI-Compatible API: vLLM server provides OpenAI-compatible API endpoints
  • Docker Support: Easy deployment with Docker and Docker Compose
  • High Performance: Leverages vLLM for efficient GPU-accelerated inference
  • Flexible Audio Input: Supports various audio formats and input methods

Project Structure

glm_asr_vllm/
β”œβ”€β”€ model/                  # Model configuration and implementation
β”‚   β”œβ”€β”€ configuration_glmasr.py    # GLM-ASR configuration
β”‚   β”œβ”€β”€ modeling_glmasr.py         # GLM-ASR model implementation
β”‚   β”œβ”€β”€ modeling_audio.py          # Audio encoding/decoding
β”‚   └── processing_glmasr.py       # Audio processing utilities
β”œβ”€β”€ server/                # vLLM integration files
β”‚   β”œβ”€β”€ glmasr_audio.py     # Audio processing for vLLM
β”‚   β”œβ”€β”€ glm_asr.py          # GLM-ASR vLLM model wrapper
β”‚   β”œβ”€β”€ registry.py         # Model registry (vLLM)
β”‚   └── server_ws.py        # WebSocket server
β”œβ”€β”€ wavs/                  # Sample audio files
β”œβ”€β”€ docker-compose.yaml     # Docker Compose configuration
β”œβ”€β”€ dockerfile             # Docker image build configuration
β”œβ”€β”€ hf_demo.py             # HuggingFace Transformers demo
└── test_vllm_api.py       # OpenAI API client test script

Prerequisites

  • Python 3.12+
  • CUDA-capable GPU (recommended)
  • Docker (for containerized deployment)
  • Docker Compose (optional)

Installation

Option 1: Local Installation

  1. Clone the repository:
git clone <repository-url>
cd glm_asr_vllm
  1. Install dependencies:
pip install torch transformers soundfile librosa openai
  1. Download the model from HuggingFace and place it in the ./model/ directory:
# Download using huggingface-cli (recommended)
huggingface-cli download bupalinyu/glm-asr-eligant --local-dir ./model

# Or use git lfs
git lfs install
git clone https://huggingface.co/bupalinyu/glm-asr-eligant ./model

Note: The model ID on HuggingFace is bupalinyu/glm-asr-eligant. After downloading, ensure all model files are in the ./model/ directory.

Option 2: Docker Deployment

  1. Build the Docker image:
docker build -t vllm-glmasr:latest .
  1. Deploy using Docker Compose:
docker-compose up -d

Usage

HuggingFace Transformers Demo

Run the local demo script to transcribe audio files:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./model/",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda")

processor = AutoProcessor.from_pretrained("./model/", trust_remote_code=True)

# Define conversations
conversations = [
    [
        {
            "role": "user",
            "content": [
                {"type": "audio", "path": "./wavs/dufu.wav"},
                {"type": "text", "text": "Please transcribe this audio."},
            ],
        }
    ],
]

# Process and generate
inputs = processor.apply_chat_template(
    conversations,
    return_tensors="pt",
    sampling_rate=16000,
    audio_padding="longest",
).to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)

print(processor.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))

Run the demo:

python hf_demo.py

vLLM API Server

Start the Server

Using Docker Compose:

docker-compose up -d

Or manually with Docker:

docker run -d \
  --name vllm-glmasr \
  --gpus all \
  --ipc host \
  --shm-size 8gb \
  -p 8300:8300 \
  -e CUDA_VISIBLE_DEVICES=2 \
  vllm-glmasr:latest

Server will be available at http://localhost:8300

API Client Example

Use the OpenAI-compatible API to transcribe audio:

import base64
import io
import soundfile as sf
import librosa
import numpy as np
from openai import OpenAI

# Configure client
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8300/v1"
)

# Load and prepare audio
def load_wav_16k(path: str):
    audio, sr = sf.read(path)
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    audio = audio.astype(np.float32)
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000).astype(np.float32)
    return audio, sr

# Convert to base64
def wav_to_base64(wav: np.ndarray, sr: int) -> str:
    buf = io.BytesIO()
    sf.write(buf, wav, sr, format="WAV", subtype="PCM_16")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

# Transcribe
pcm, sr = load_wav_16k("path/to/audio.wav")
audio_b64 = wav_to_base64(pcm, sr)

resp = client.chat.completions.create(
    model="glm-asr-eligant",
    max_completion_tokens=256,
    temperature=0.0,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please transcribe this audio.<|audio|>"},
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "wav",
                    },
                },
            ],
        }
    ],
)

print(resp.choices[0].message.content)

Run the test script:

python test_vllm_api.py

Configuration

Docker Compose Settings

Modify docker-compose.yaml to adjust:

  • GPU Selection: CUDA_VISIBLE_DEVICES environment variable
  • Port: ports mapping (default: 8300:8300)
  • GPU Memory: gpu-memory-utilization parameter (default: 0.1)
  • Model Length: max-model-len parameter (default: 4096)

vLLM Server Parameters

Key parameters configured in docker-compose.yaml:

  • --host: Server host address (default: 0.0.0.0)
  • --port: Server port (default: 8300)
  • --served-model-name: Model name for API calls (default: glm-asr-eligant)
  • --dtype: Data type (default: auto)
  • --tensor-parallel-size: Tensor parallelism size (default: 1)
  • --max-model-len: Maximum model sequence length (default: 4096)
  • --trust-remote-code: Allow remote code execution
  • --gpu-memory-utilization: GPU memory utilization 0-1 (default: 0.1)
  • --api-key: API key for authentication (default: EMPTY)

Model Architecture

GLM-ASR combines:

  • Whisper Encoder: Audio feature extraction
  • LLM Backbone: Text generation (based on GLM architecture)
  • Multimodal Adapter: Bridges audio and text representations

Key configurations from configuration_glmasr.py:

  • Adapter Type: MLP (default) with merge factor of 4
  • RoPE: Rotary Position Embeddings enabled
  • Spec Aug: Spectral augmentation (disabled by default)
  • Max Whisper Length: 1500 tokens
  • MLP Activation: GELU

Audio Input Requirements

  • Sampling Rate: 16 kHz (audio will be resampled if needed)
  • Channels: Mono (stereo will be downmixed to mono)
  • Formats: WAV, FLAC, OGG (via base64 encoding)
  • Duration: Limited by max_model_len parameter

The processor (processing_glmasr.py) supports:

  • Audio file paths
  • NumPy arrays
  • Base64 encoded audio
  • Batch processing with padding

API Reference

Chat Completions Endpoint

POST /v1/chat/completions

Request body:

{
  "model": "glm-asr-eligant",
  "max_completion_tokens": 256,
  "temperature": 0.0,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Please transcribe this audio.<|audio|>"
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64_encoded_audio>",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

vLLM Integration

The project integrates GLM-ASR with vLLM through:

Troubleshooting

GPU Memory Issues

Reduce gpu-memory-utilization or decrease max-model-len in docker-compose.yaml

Slow Inference

  • Enable tensor parallelism with --tensor-parallel-size
  • Ensure proper GPU selection via CUDA_VISIBLE_DEVICES
  • Check GPU utilization with nvidia-smi

Connection Refused

  • Verify the Docker container is running: docker ps
  • Check port mapping is correct
  • Ensure firewall allows traffic on port 8300

Model Loading Issues

  • Verify model weights are in the correct directory (./model/)
  • Check trust_remote_code is enabled
  • Ensure sufficient disk space for model files

License

This project uses the Apache 2.0 license (see server/registry.py).

Acknowledgments

  • GLM-ASR Model: Original model authors
  • vLLM: High-performance LLM inference engine
  • Transformers: HuggingFace model utilities
  • Whisper: OpenAI audio encoder

Related Projects

GPA - ASR, TTS and Voice Conversion in One

GPA MODEL

A unified audio model that combines ASR (Automatic Speech Recognition), TTS (Text-to-Speech), and voice conversion in just 0.3B parameters. This model is specifically designed for:

  • Edge deployment: Lightweight model suitable for mobile devices and edge computing
  • Commercial services: Optimized for large-scale production deployment
  • All-in-one solution: Single model for speech recognition, synthesis, and voice conversion

If you need a more compact, multi-functional audio solution for edge or commercial scenarios, consider exploring the GPA project.

Downloads last month
6
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support