Qolda

Introduction

Built on top of InternVL3.5 and Qwen3, Qolda is a small vision-language model designed to operate in Kazakh, Russian, and English. The model has 4.3B parameters and comprises the InternViT-300M vision encoder and MLP Projector components from InternVL3.5-4B, along with the Qwen3-4B language model. Model training was performed using the InternVL framework 💙

The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.

Evaluation Results

Evaluation was conducted separately for text-only and vision-language modalities. Qolda demonstrates significant performance improvements for Kazakh while maintaining comparable performance on Russian and English.

Text Benchmarks

Performance comparison on language tasks including MMLU, Winogrande, HellaSwag, ARC, GSM8K, and DROP.

Note: The comparison below presents Qolda's performance against Qwen3-4B on Kazakh language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

Model	Mode	Avg	MMLU	Winogrande	HellaSwag	ARC	GSM8K	DROP
Qwen3-4B	Direct	52.00	42.43	56.88	42.04	64.77	73.62	32.27
Qwen3-4B	Think	57.73	52.98	51.27	41.86	79.65	64.82	55.81
Qolda	Direct	58.77	46.55	56.37	55.75	73.62	63.50	56.84
Qolda	Think	71.64	64.56	70.54	57.70	89.99	79.47	67.59

Vision Benchmarks

Performance comparison on vision-language tasks including AI2D, MMStar, RealWorldQA, and KazakhOCR.

Note: The comparison below presents Qolda's performance against InternVL3.5-4B on Kazakh vision-language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

Model	Mode	Avg	AI2D	MMStar	RealWorldQA	KazakhOCR
InternVL3.5-4B	Direct	42.23	52.33	47.47	38.32	30.81
InternVL3.5-4B	Think	42.58	51.42	49.33	38.74	30.81
Qolda	Direct	59.39	66.06	55.47	54.97	61.06
Qolda	Think	60.44	67.62	56.53	57.07	60.54

Model Usage

To run inference with Transformers, please follow the guidelines from InternVL.

Alternatively, to run the model via an OpenAI-compatible server, you can use lmdeploy:

pip install lmdeploy>=0.9.1

lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch

Note: Unlike the original InternVL3.5, this model requires the enable_thinking parameter to be explicitly set in the extra_body of your API calls. However, depending on the task complexity, an empty thinking response might be generated.

Then, make a standard API call:

import base64
from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "./assets/eval-results-text.png"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Берілген диаграмманың сипаттамасын бер.'
            },
            {
                'type': 'image_url',
                'image_url': {
                    'url': f'data:image/png;base64,{encode_image(image_path)}',
                },
            }
        ],
    }],
    max_tokens=8192,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "enable_thinking": True
    },
)

print(response.choices[0].message.content)

License

This model is licensed under the Apache License 2.0.

Кіріспе

InternVL3.5 және Qwen3 негізінде жасалған Qolda — қазақ, орыс және ағылшын тілдерінде жұмыс істеуге арналған шағын көру-тілдік моделі (vision-language model). Модель 4,3 млрд параметрге ие және InternVL3.5-4B моделінің InternViT-300M көру энкодері мен MLP проектор компоненттерін, сондай-ақ Qwen3-4B тілдік моделін қамтиды. Модельді оқыту InternVL фреймворкі көмегімен жүзеге асырылды 💙

"Qolda" атауы модельдің дизайны мен мақсатын қазақ тіліндегі қолда сөзінің қос мағынасы арқылы көрсетеді. Біріншісі, шағын әрі қолжетімді болуы үшін "қолда" cөзі арқылы және екіншісі, көмекші табиғаты үшін, "қолдау" мағынасы арқылы.

Бағалау нәтижелері

Мәтіндік және көру-тілдік модальділіктер үшін бағалау бөлек жүргізілді. Qolda орыс және ағылшын тілдеріндегі өзінің бастапқы деңгейін сақтай отырып, қазақ тіліндегі өнімділігін айтарлықтай жақсартты.

Мәтіндік бенчмарктар

MMLU, Winogrande, HellaSwag, ARC, GSM8K және DROP сияқты тілдік тапсырмалардағы өнімділікті салыстыру.

Ескерту: Төмендегі кестедегі Qolda және Qwen3-4B модельдерінің салыстырылуы тек қазақ тіліндегі бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

Model	Mode	Avg	MMLU	Winogrande	HellaSwag	ARC	GSM8K	DROP
Qwen3-4B	Direct	52.00	42.43	56.88	42.04	64.77	73.62	32.27
Qwen3-4B	Think	57.73	52.98	51.27	41.86	79.65	64.82	55.81
Qolda	Direct	58.77	46.55	56.37	55.75	73.62	63.50	56.84
Qolda	Think	71.64	64.56	70.54	57.70	89.99	79.47	67.59

Көру бенчмарктары

AI2D, MMStar, RealWorldQA және KazakhOCR сияқты көру-тілдік тапсырмаларындағы өнімділікті салыстыру.

Ескерту: Төмендегі кестедегі Qolda және InternVL3.5-4B модельдерінің салыстырылуы тек қазақ тіліндегі көру-тілдік бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

Model	Mode	Avg	AI2D	MMStar	RealWorldQA	KazakhOCR
InternVL3.5-4B	Direct	42.23	52.33	47.47	38.32	30.81
InternVL3.5-4B	Think	42.58	51.42	49.33	38.74	30.81
Qolda	Direct	59.39	66.06	55.47	54.97	61.06
Qolda	Think	60.44	67.62	56.53	57.07	60.54

Модельді қолдану

Transformers арқылы инференсті іске қосу үшін InternVL ұсынған нұсқаулықтарды орындаңыз.

Немесе, модельді OpenAI-үйлесімді сервер арқылы іске қосу үшін lmdeploy құралын пайдалануға болады:

pip install lmdeploy>=0.9.1

lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch

Ескерту: Qolda-ның түпнұсқалық InternVL3.5-тен айырмашылығы, бұл модель API call жасаған кезде extra_body бөлігінде enable_thinking параметрінің нақты орнатылуын талап етеді. Тапсырманың күрделілігіне байланысты бос thinking жауабы қайтарылуы мүмкін.

Содан соң, стандартты API call жасаңыз:

import base64
from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "./assets/eval-results-text.png"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Берілген диаграмманың сипаттамасын бер.'
            },
            {
                'type': 'image_url',
                'image_url': {
                    'url': f'data:image/png;base64,{encode_image(image_path)}',
                },
            }
        ],
    }],
    max_tokens=8192,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "enable_thinking": True
    },
)

print(response.choices[0].message.content)