YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Card for STiFLeR7/Phi-3-mini-4k-gguf

Model Details

Model Description This is the model card of a ๐Ÿค— Transformers-compatible model pushed on the Hub. The model is a Phi-3 Mini 4K Instruct variant quantized into GGUF 4-bit format (~2.3 GB) for efficient inference with llama.cpp, llama-cpp-python, or Ollama.

  • Developed by: Microsoft (base model), quantized and shared by STiFLeR7
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: STiFLeR7
  • Model type: Causal Language Model (decoder-only)
  • Language(s): English (primary)
  • License: Responsible AI License (from Microsoft Phi-3)
  • Finetuned from: microsoft/Phi-3-mini-4k-instruct

Model Sources


Uses

Direct Use

  • Text generation in resource-constrained environments.
  • Running instruction-tuned Phi-3 Mini on CPU or GPU via GGUF.
  • Edge inference and deployment with llama.cpp, llama-cpp-python, or Ollama.

Downstream Use [optional]

  • Can serve as a lightweight assistant for experimentation.
  • Basis for further evaluation and benchmarking of quantized models.

Out-of-Scope Use

  • High-stakes decision making (e.g., medical, legal, financial).
  • Safety-critical applications.
  • Applications requiring full precision or fine-tuning.

Bias, Risks, and Limitations

  • Model inherits biases and limitations from the Phi-3 Mini 4K base.
  • Quantization may reduce reasoning and factual accuracy.
  • Not optimized for fine-tuning.
  • Context length limited to 4K tokens.

Recommendations: Users should be aware of biases and reduced accuracy introduced by quantization.


How to Get Started with the Model

With llama.cpp

./main -m Phi-3-mini-4k-gguf/model.gguf -p "Write a short poem about AI."

With llama-cpp-python

from llama_cpp import Llama

llm = Llama(model_path="Phi-3-mini-4k-gguf/model.gguf")
print(llm("Hello world", max_tokens=128))

With Ollama

ollama create phi3-mini-4k-gguf -f Phi-3-mini-4k-gguf/model.gguf
ollama run phi3-mini-4k-gguf

Training Details

Training Data

  • Inherits data sources from [microsoft/Phi-3-mini-4k-instruct].

Training Procedure

  • Quantization performed using llama.cpp GGUF conversion tools.

Training Hyperparameters

  • Quantization: 4-bit (e.g., q4_K_M or similar; specify if known).
  • File size: ~2.3 GB.

Speeds, Sizes, Times [optional]

  • Optimized for CPU/GPU inference in edge scenarios.

Evaluation

Testing Data

  • Not separately evaluated post-quantization. Inherits evaluation from base model.

Factors

  • Precision loss due to quantization.
  • Edge vs. GPU deployment performance may vary.

Metrics

  • Not benchmarked beyond functional testing.

Results

  • Produces valid completions at reduced memory cost.

Summary

  • Model Type: Quantized Phi-3 Mini 4K Instruct (GGUF 4-bit).
  • Use Case: Efficient inference on CPUs and GPUs.
  • Limitations: Reduced reasoning accuracy, inherits base model biases.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator from Lacoste et al. (2019).

  • Hardware Type: CPU/GPU (for inference).
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications

Model Architecture and Objective

  • Transformer decoder-only architecture.
  • Instruction-tuned variant of Phi-3 Mini.

Compute Infrastructure

  • Quantization performed locally (details omitted).

Hardware

  • Tested on CPU + consumer GPU.

Software

  • llama.cpp, llama-cpp-python, Hugging Face Hub.

Citation [optional]

If you use this model, please cite the base model:

@misc{microsoft2024phi3,
  title={Phi-3: Language Models with Long Context},
  author={Microsoft Research},
  year={2024},
  url={https://huggingface.co/microsoft/Phi-3-mini-4k-instruct}
}

Glossary [optional]

  • GGUF: New binary format for quantized models used by llama.cpp.
  • Quantization: Technique to reduce model size and improve inference efficiency by lowering precision.

More Information [optional]

For detailed usage of GGUF format, see llama.cpp GGUF documentation.


Model Card Authors [optional]

  • STiFLeR7

Model Card Contact

Downloads last month
5
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using STiFLeR7/Phi-3-mini-4k-gguf 1