--- language: - ja - en license: apache-2.0 library_name: transformers base_model: abeja/Qwen2.5-7B-Japanese tags: - qwen2.5 - japanese - text-generation - pytorch - quantized - onnx - qnn - qualcomm pipeline_tag: text-generation --- # ABEJA Qwen 2.5 7B Japanese - QNN Optimized This repository contains the ABEJA Qwen 2.5 7B Japanese model optimized for Qualcomm Neural Network (QNN) deployment. ## Model Details - **Base Model**: abeja/Qwen2.5-7B-Japanese - **Architecture**: Qwen2ForCausalLM - **Parameters**: ~7.6B - **Language**: Japanese (primary), English (secondary) - **Quantization**: 4-bit NF4 - **Target Hardware**: Snapdragon 8cx Gen 2+ (SM8350) ## Available Formats ### 1. Quantized PyTorch Model - **Path**: `quantized_simple/` - **Format**: 4-bit NF4 quantized - **Size**: ~4.5GB (reduced from ~15GB) - **Usage**: Direct inference with transformers ### 2. ONNX Models - **Path**: `onnx/` - **Models**: - `prefill/model.onnx` - Context prefill - `token_gen/model.onnx` - Token generation - **Usage**: Cross-platform inference ### 3. Quantized ONNX Models - **Path**: `quantized_onnx/` - **Format**: Dynamic quantization (INT8) - **Usage**: Optimized ONNX inference ### 4. QNN Compiled Models - **Path**: `qnn_compiled/` - **Format**: Qualcomm Neural Network format - **Target**: Snapdragon devices - **Usage**: Native ARM64 deployment ## Usage ### Quantized PyTorch Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple") tokenizer = AutoTokenizer.from_pretrained("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn", subfolder="quantized_simple") # Japanese text generation inputs = tokenizer("こんにちは、私は", return_tensors="pt") outputs = model.generate(**inputs, max_length=100, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### ONNX Inference ```python import onnxruntime as ort # Load ONNX model session = ort.InferenceSession("marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/onnx/prefill/model.onnx") # Run inference... ``` ### QNN Deployment ```bash # Deploy to Snapdragon device adb push marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/qnn_compiled/ /data/local/tmp/qnn_model/ # Use QNN runtime for inference ``` ## Performance - **Quantization**: 75% size reduction - **Speed**: 2-3x faster inference - **Memory**: ~4.5GB RAM usage - **Tokens/sec**: 8-15 tokens/sec on Snapdragon 8cx Gen 2+ ## Hardware Compatibility - ✅ Snapdragon 8cx Gen 2+ - ✅ Snapdragon 8cx Gen 3 - ✅ Snapdragon 8 Gen 1+ - ✅ Windows on ARM devices - ✅ Microsoft Surface Pro X - ✅ Dell Latitude 7420 ## Files Structure ``` marcusmi4n/abeja-qwen2.5-7b-japanese-qnn/ ├── quantized_simple/ # 4-bit quantized PyTorch model │ ├── model.safetensors │ ├── config.json │ ├── tokenizer.json │ └── model_info.json ├── onnx/ # ONNX models │ ├── prefill/model.onnx │ └── token_gen/model.onnx ├── quantized_onnx/ # Quantized ONNX models │ ├── prefill/model_quantized.onnx │ └── token_gen/model_quantized.onnx ├── qnn_compiled/ # QNN compiled models │ ├── prefill/ │ ├── token_gen/ │ └── deployment_info.json └── README.md # This file ``` ## License Apache 2.0 - Same as base ABEJA Qwen 2.5 model ## Citation ```bibtex @misc{abeja-qwen25-qnn, title={ABEJA Qwen 2.5 7B Japanese - QNN Optimized}, author={QNN Conversion Pipeline}, year={2025}, url={https://huggingface.co/marcusmi4n/abeja-qwen2.5-7b-japanese-qnn} } ``` ## Base Model Citation Please cite the original ABEJA Qwen 2.5 paper: ```bibtex @article{abeja-qwen2.5, title={ABEJA Qwen 2.5: Japanese Language Model}, author={ABEJA Inc.}, journal={arXiv preprint}, year={2024} } ```