---
language: 
- ja
- en
license: apache-2.0
library_name: transformers
base_model: abeja/Qwen2.5-7B-Japanese
tags:
- qwen2.5
- japanese
- text-generation
- pytorch
- quantized
- onnx
- qnn
- qualcomm
pipeline_tag: text-generation
---

# ABEJA Qwen 2.5 7B Japanese - QNN Optimized

This repository contains the ABEJA Qwen 2.5 7B Japanese model optimized for Qualcomm Neural Network (QNN) deployment.

## Model Details

- **Base Model**: abeja/Qwen2.5-7B-Japanese
- **Architecture**: Qwen2ForCausalLM
- **Parameters**: ~7.6B
- **Language**: Japanese (primary), English (secondary)
- **Quantization**: 4-bit NF4
- **Target Hardware**: Snapdragon 8cx Gen 2+ (SM8350)

## Available Formats

### 1. Quantized PyTorch Model
- **Path**: `quantized_simple/`
- **Format**: 4-bit NF4 quantized
- **Size**: ~4.5GB (reduced from ~15GB)
- **Usage**: Direct inference with transformers

### 2. ONNX Models
- **Path**: `onnx/`
- **Models**: 
  - `prefill/model.onnx` - Context prefill
  - `token_gen/model.onnx` - Token generation
- **Usage**: Cross-platform inference

### 3. Quantized ONNX Models
- **Path**: `quantized_onnx/`
- **Format**: Dynamic quantization (INT8)
- **Usage**: Optimized ONNX inference

### 4. QNN Compiled Models
- **Path**: `qnn_compiled/`
- **Format**: Qualcomm Neural Network format
- **Target**: Snapdragon devices
- **Usage**: Native ARM64 deployment

## Usage

### Quantized PyTorch Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple")
tokenizer = AutoTokenizer.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple")

# Japanese text generation
inputs = tokenizer("こんにちは、私は", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### ONNX Inference

```python
import onnxruntime as ort

# Load ONNX model
session = ort.InferenceSession("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/onnx/prefill/model.onnx")
# Run inference...
```

### QNN Deployment

```bash
# Deploy to Snapdragon device
adb push marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/qnn_compiled/ /data/local/tmp/qnn_model/
# Use QNN runtime for inference
```

## Performance

- **Quantization**: 75% size reduction
- **Speed**: 2-3x faster inference
- **Memory**: ~4.5GB RAM usage
- **Tokens/sec**: 8-15 tokens/sec on Snapdragon 8cx Gen 2+

## Hardware Compatibility

- ✅ Snapdragon 8cx Gen 2+
- ✅ Snapdragon 8cx Gen 3
- ✅ Snapdragon 8 Gen 1+
- ✅ Windows on ARM devices
- ✅ Microsoft Surface Pro X
- ✅ Dell Latitude 7420

## Files Structure

```
marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/
├── quantized_simple/          # 4-bit quantized PyTorch model
│   ├── model.safetensors
│   ├── config.json
│   ├── tokenizer.json
│   └── model_info.json
├── onnx/                      # ONNX models
│   ├── prefill/model.onnx
│   └── token_gen/model.onnx
├── quantized_onnx/            # Quantized ONNX models
│   ├── prefill/model_quantized.onnx
│   └── token_gen/model_quantized.onnx
├── qnn_compiled/              # QNN compiled models
│   ├── prefill/
│   ├── token_gen/
│   └── deployment_info.json
└── README.md                  # This file
```

## License

Apache 2.0 - Same as base ABEJA Qwen 2.5 model

## Citation

```bibtex
@misc{abeja-qwen25-qnn,
  title={ABEJA Qwen 2.5 7B Japanese - QNN Optimized},
  author={QNN Conversion Pipeline},
  year={2025},
  url={https://huggingface.co/marcusmi4n/abeja-qwen2.5-7b-japanese-qnn}
}
```

## Base Model Citation

Please cite the original ABEJA Qwen 2.5 paper:

```bibtex
@article{abeja-qwen2.5,
  title={ABEJA Qwen 2.5: Japanese Language Model},
  author={ABEJA Inc.},
  journal={arXiv preprint},
  year={2024}
}
```