--- title: "Getting Started with llamaR" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with llamaR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE, purl=FALSE} # Every chunk needs a GGUF model (and usually a GPU), so this vignette is # static: the code is shown but not run at build time. knitr::opts_chunk$set(eval = FALSE, purl = FALSE) ``` llamaR provides R bindings to [llama.cpp](https://github.com/ggml-org/llama.cpp) for running Large Language Models locally, with optional Vulkan GPU acceleration via [ggmlR](https://github.com/Zabis13/ggmlR). This vignette walks through the core workflow: get a model, load it, generate text, tokenize, and extract embeddings. For the chat/server side see `vignette("chat-and-agents")`. ```{r, eval=FALSE, purl=FALSE} library(llamaR) ``` --- ## 1. Getting a model llamaR works with GGUF files. Download one from the Hugging Face Hub (cached under `~/.cache/llamaR/` by default): ```{r, eval=FALSE, purl=FALSE} # List the GGUF files in a repo llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF") # Download one (by filename or by quantization pattern) path <- llama_hf_download( "TheBloke/Mistral-7B-Instruct-v0.2-GGUF", pattern = "Q4_K_M" ) ``` Or point at any GGUF file you already have on disk. --- ## 2. Loading a model and creating a context A **model** holds the weights; a **context** holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional. ```{r, eval=FALSE, purl=FALSE} model <- llama_load_model(path, n_gpu_layers = -1L) # -1 = offload all layers ctx <- llama_new_context(model, n_ctx = 4096L) llama_model_info(model) # size, n_params, context length, heads, ... ``` `n_gpu_layers = -1L` offloads every layer to the GPU when Vulkan is available, and falls back to CPU otherwise. --- ## 3. Generating text ```{r, eval=FALSE, purl=FALSE} llama_generate(ctx, "The capital of France is", max_new_tokens = 32L) ``` Sampling is controlled by arguments (set `temp = 0` for greedy decoding): ```{r, eval=FALSE, purl=FALSE} llama_generate( ctx, "Write a haiku about autumn.", max_new_tokens = 64L, temp = 0.7, top_p = 0.9, top_k = 40L, repeat_penalty = 1.1 ) ``` Pass `with_timings = TRUE` to get token throughput alongside the text. --- ## 4. Chat models and templates Instruction-tuned models expect their prompt wrapped in a chat template (`[INST]…[/INST]`, `<|im_start|>…`, etc.). `llama_chat_apply_template()` builds that prompt from a list of role/content messages: ```{r, eval=FALSE, purl=FALSE} messages <- list( list(role = "system", content = "You are a helpful assistant."), list(role = "user", content = "Name three primary colors.") ) prompt <- llama_chat_apply_template(messages) # uses the model's built-in template llama_generate(ctx, prompt, max_new_tokens = 64L) ``` For multi-turn chat with history management, use `chat_llamar()` instead — see `vignette("chat-and-agents")`. --- ## 5. Tokenization ```{r, eval=FALSE, purl=FALSE} tokens <- llama_tokenize(ctx, "Hello, world!") tokens llama_detokenize(ctx, tokens) ``` When tokenizing a prompt that already contains role markers from a chat template, set `parse_special = TRUE` so markers like `[INST]` become single control tokens rather than literal characters: ```{r, eval=FALSE, purl=FALSE} prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi"))) llama_tokenize(ctx, prompt, parse_special = TRUE) ``` --- ## 6. Embeddings Create the context in **embedding mode**, then extract vectors. Single text: ```{r, eval=FALSE, purl=FALSE} emb_model <- llama_load_model("embedding-model.gguf") emb_ctx <- llama_new_context(emb_model, embedding = TRUE) v <- llama_embeddings(emb_ctx, "The quick brown fox") length(v) ``` A batch of texts in one call: ```{r, eval=FALSE, purl=FALSE} m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text")) dim(m) # one row per input ``` ### ragnar-compatible provider `embed_llamar()` is a higher-level helper that loads the model for you and returns a provider suitable for `ragnar_store_create(embed = ...)`. Called with a model only, it returns a closure (partial application); called with text, it returns a matrix. ```{r, eval=FALSE, purl=FALSE} library(ragnar) store <- ragnar_store_create( location = "store.duckdb", embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L) ) ragnar_store_insert(store, documents) ragnar_store_build_index(store) ragnar_retrieve(store, "search query") ``` Combine this with a local `chat_llamar()` for a fully local RAG stack — see `vignette("chat-and-agents")`. --- ## 7. Serving and chatting To talk to a model over HTTP, or to use it through the ellmer/ragnar toolchain, see `vignette("chat-and-agents")`: * `llama_serve_openai()` — OpenAI-compatible HTTP server. * `chat_llamar()` — an `ellmer::Chat` backed by a local model. --- ## See also * `vignette("chat-and-agents")` — server, ellmer, ragnar, OpenCode. * `?llama_generate`, `?llama_chat_apply_template`, `?embed_llamar` * The package README for installation and GPU/Vulkan setup.