llamafile
Running Local GGUF Models with llamafile
This guide provides a step-by-step approach to running GGUF models like Llama 2, Phi-3, and Gemma 3 locally using llamafile
, covering installation and usage across different operating systems.
Run a GGUF Model with llamafile (Windows / Linux)
Downloading llamafile
Visit the llamafile releases page and download the appropriate executable for your OS (Windows, Linux).
Downloading GGUF Models
Llama 2 (7B Q8_0):
https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q8_0.gguf
Phi-3 mini 4k instruct (Full Precision 16-bit floating point):
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/resolve/main/Phi-3-mini-4k-instruct-fp16.gguf
Running Models
Windows
Llama 2:
.\llamafile-0.9.1.exe --server --v2 -m llama-2-7b.Q8_0.gguf
Phi-3:
.\llamafile-0.9.1.exe --server --v2 -m Phi-3-mini-4k-instruct-fp16.gguf
Linux
Llama 2:
./llamafile --server --v2 -m llama-2-7b.Q8_0.gguf
Phi-3:
./llamafile --server --v2 -m Phi-3-mini-4k-instruct-fp16.gguf
Optional Parameters
-c N, --ctx-size N
: Sets the maximum context size in tokens for chat mode. The default is8192
tokens, but it can be adjusted. If set to0
, it will use the model’s maximum context size.
Building llamafile from Source (Linux)
Install Dependencies
Red Hat (Fedora, CentOS):
sudo dnf install make unzip
Debian/Ubuntu:
sudo apt-get update && sudo apt-get install build-essential unzip
Clone Repository
git clone https://github.com/Mozilla-Ocho/llamafile.git
Compile and Install
sudo make install PREFIX=/usr/local
Important Note: By using
PREFIX=/usr/local
, you are instructing themake install
command to place the executable file (llamafile
) in the/usr/local/bin
directory.Download and Store GGUF in
models
FolderAs an example, to download the
Gemma 3
model:wget https://huggingface.co/unsloth/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-BF16.gguf -O gemma-3-1b-it-BF16.gguf
Converting Models
Use llamafile-convert to convert other models:
llamafile-convert models/gemma-3-1b-it-BF16.gguf