ABEJA Qwen 2.5 7B Japanese - ONNX Models / ABEJA Qwen 2.5 7B 日本語 - ONNXモデル

English

Model Overview

This repository contains ONNX models for the ABEJA Qwen 2.5 7B Japanese model, optimized for cross-platform inference. The models are split into prefill and token generation components for optimal performance.

Model Details

Base Model: abeja/Qwen2.5-7B-Japanese
Architecture: Qwen2ForCausalLM
Parameters: ~7.6B
Language: Japanese (primary), English (secondary)
Format: ONNX
Models: Prefill + Token Generation

Available Models

1. Prefill Model

File: prefill/model.onnx
Purpose: Context prefill for initial prompt processing
Size: ~28.7MB
Input: Token sequences
Output: Hidden states

2. Token Generation Model

File: token_gen/model.onnx
Purpose: Token-by-token generation
Size: ~28.7MB
Input: Hidden states
Output: Next token probabilities

System Requirements

Minimum Requirements

CPU: Intel i5-8400 / AMD Ryzen 5 2600 or better
RAM: 8GB system memory
Storage: 2GB free space
OS: Windows 10/11, macOS 10.15+, Ubuntu 18.04+

Recommended Requirements

CPU: Intel i7-10700K / AMD Ryzen 7 3700X or better
RAM: 16GB system memory
GPU: NVIDIA RTX 3060 (8GB VRAM) or better
Storage: 5GB free SSD space

Supported Devices

Desktop: Windows, macOS, Linux
Cloud: AWS, Google Cloud, Azure
Edge: NVIDIA Jetson Nano, Raspberry Pi 4 (8GB)
Mobile: iOS, Android
Embedded: ARM Cortex-A78, Intel Atom

Usage

Python with ONNX Runtime

import onnxruntime as ort
import numpy as np

# Load models
prefill_session = ort.InferenceSession('prefill/model.onnx')
token_gen_session = ort.InferenceSession('token_gen/model.onnx')

# Example inference
input_ids = np.array([[1, 2, 3, 4, 5]], dtype=np.int64)
prefill_outputs = prefill_session.run(None, {"input_ids": input_ids})
token_outputs = token_gen_session.run(None, {"hidden_states": prefill_outputs[0]})

C++ with ONNX Runtime

#include <onnxruntime_cxx_api.h>

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
Ort::Session session(env, "prefill/model.onnx", Ort::SessionOptions{{nullptr}});

// Run inference
std::vector<int64_t> input_shape = {1, 5};
std::vector<int64_t> input_data = {1, 2, 3, 4, 5};
auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
Ort::Value input_tensor = Ort::Value::CreateTensor<int64_t>(
    memory_info, input_data.data(), input_data.size(), input_shape.data(), input_shape.size());

Installation

# CPU version
pip install onnxruntime

# GPU version (NVIDIA)
pip install onnxruntime-gpu

# For mobile deployment
pip install onnxruntime-mobile

Performance

Cross-platform: Works on any ONNX Runtime supported platform
Optimized: Optimized for inference speed
Memory efficient: Lower memory usage than PyTorch
Production ready: Suitable for production deployments
Latency: <100ms for prefill, <50ms for token generation

日本語

モデル概要

このリポジトリには、クロスプラットフォーム推論用に最適化されたABEJA Qwen 2.5 7B日本語モデルのONNXモデルが含まれています。モデルは最適なパフォーマンスのためにプレフィルとトークン生成コンポーネントに分割されています。

モデル詳細

ベースモデル: abeja/Qwen2.5-7B-Japanese
アーキテクチャ: Qwen2ForCausalLM
パラメータ数: ~7.6B
言語: 日本語（主要）、英語（副次）
フォーマット: ONNX
モデル: プレフィル + トークン生成

利用可能なモデル

1. プレフィルモデル

ファイル: prefill/model.onnx
目的: 初期プロンプト処理のためのコンテキストプレフィル
サイズ: ~28.7MB
入力: トークンシーケンス
出力: 隠れ状態

2. トークン生成モデル

ファイル: token_gen/model.onnx
目的: トークンごとの生成
サイズ: ~28.7MB
入力: 隠れ状態
出力: 次のトークン確率

システム要件

最小要件

CPU: Intel i5-8400 / AMD Ryzen 5 2600以上
RAM: 8GBシステムメモリ
ストレージ: 2GB空き容量
OS: Windows 10/11、macOS 10.15+、Ubuntu 18.04+

推奨要件

CPU: Intel i7-10700K / AMD Ryzen 7 3700X以上
RAM: 16GBシステムメモリ
GPU: NVIDIA RTX 3060（8GB VRAM）以上
ストレージ: 5GB空きSSD容量

対応デバイス

デスクトップ: Windows、macOS、Linux
クラウド: AWS、Google Cloud、Azure
エッジ: NVIDIA Jetson Nano、Raspberry Pi 4（8GB）
モバイル: iOS、Android
組み込み: ARM Cortex-A78、Intel Atom

使用方法

Python with ONNX Runtime

import onnxruntime as ort
import numpy as np

# モデルを読み込み
prefill_session = ort.InferenceSession('prefill/model.onnx')
token_gen_session = ort.InferenceSession('token_gen/model.onnx')

# 推論例
input_ids = np.array([[1, 2, 3, 4, 5]], dtype=np.int64)
prefill_outputs = prefill_session.run(None, {"input_ids": input_ids})
token_outputs = token_gen_session.run(None, {"hidden_states": prefill_outputs[0]})

インストール

# CPU版
pip install onnxruntime

# GPU版（NVIDIA）
pip install onnxruntime-gpu

# モバイルデプロイ用
pip install onnxruntime-mobile

パフォーマンス

クロスプラットフォーム: ONNX Runtime対応プラットフォームで動作
最適化: 推論速度用に最適化
メモリ効率: PyTorchより低いメモリ使用量
本番対応: 本番デプロイメントに適している
レイテンシ: プレフィル<100ms、トークン生成<50ms

Author: Mukwaya Mark

Downloads last month: -; Downloads are not tracked for this model. How to track