Getting Started with llamaR

llamaR provides R bindings to llama.cpp for running Large Language Models locally, with optional Vulkan GPU acceleration via ggmlR. This vignette walks through the core workflow: get a model, load it, generate text, tokenize, and extract embeddings. For the chat/server side see vignette("chat-and-agents").

1. Getting a model

llamaR works with GGUF files. Download one from the Hugging Face Hub (cached under ~/.cache/llamaR/ by default):

# List the GGUF files in a repo
llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF")

# Download one (by filename or by quantization pattern)
path <- llama_hf_download(
  "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
  pattern = "Q4_K_M"
)

Or point at any GGUF file you already have on disk.

2. Loading a model and creating a context

A model holds the weights; a context holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional.

model <- llama_load_model(path, n_gpu_layers = -1L)   # -1 = offload all layers
ctx   <- llama_new_context(model, n_ctx = 4096L)

llama_model_info(model)   # size, n_params, context length, heads, ...

n_gpu_layers = -1L offloads every layer to the GPU when Vulkan is available, and falls back to CPU otherwise.

3. Generating text

llama_generate(ctx, "The capital of France is", max_new_tokens = 32L)

Sampling is controlled by arguments (set temp = 0 for greedy decoding):

llama_generate(
  ctx, "Write a haiku about autumn.",
  max_new_tokens = 64L,
  temp           = 0.7,
  top_p          = 0.9,
  top_k          = 40L,
  repeat_penalty = 1.1
)

Pass with_timings = TRUE to get token throughput alongside the text.

4. Chat models and templates

Instruction-tuned models expect their prompt wrapped in a chat template ([INST]…[/INST], <|im_start|>…, etc.). llama_chat_apply_template() builds that prompt from a list of role/content messages:

messages <- list(
  list(role = "system",    content = "You are a helpful assistant."),
  list(role = "user",      content = "Name three primary colors.")
)

prompt <- llama_chat_apply_template(messages)   # uses the model's built-in template
llama_generate(ctx, prompt, max_new_tokens = 64L)

For multi-turn chat with history management, use chat_llamar() instead — see vignette("chat-and-agents").

5. Tokenization

tokens <- llama_tokenize(ctx, "Hello, world!")
tokens

llama_detokenize(ctx, tokens)

When tokenizing a prompt that already contains role markers from a chat template, set parse_special = TRUE so markers like [INST] become single control tokens rather than literal characters:

prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi")))
llama_tokenize(ctx, prompt, parse_special = TRUE)

6. Embeddings

Create the context in embedding mode, then extract vectors. Single text:

emb_model <- llama_load_model("embedding-model.gguf")
emb_ctx   <- llama_new_context(emb_model, embedding = TRUE)

v <- llama_embeddings(emb_ctx, "The quick brown fox")
length(v)

A batch of texts in one call:

m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text"))
dim(m)   # one row per input

ragnar-compatible provider

embed_llamar() is a higher-level helper that loads the model for you and returns a provider suitable for ragnar_store_create(embed = ...). Called with a model only, it returns a closure (partial application); called with text, it returns a matrix.

library(ragnar)

store <- ragnar_store_create(
  location = "store.duckdb",
  embed    = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L)
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
ragnar_retrieve(store, "search query")

Combine this with a local chat_llamar() for a fully local RAG stack — see vignette("chat-and-agents").

7. Serving and chatting

To talk to a model over HTTP, or to use it through the ellmer/ragnar toolchain, see vignette("chat-and-agents"):

llama_serve_openai() — OpenAI-compatible HTTP server.
chat_llamar() — an ellmer::Chat backed by a local model.