Tuesday, July 15, 2025

Repost: Tidy RAG in R with ragnar

 Reposted from the original at: https://blog.stephenturner.us/p/tidy-rag-in-r-with-ragnar

Retrieval augmented generation in R using the ragnar package. Demonstration: scraping text from relevant links on a website and using RAG to ask about a university's grant funding.

 

Note: After I wrote this post last week, the Tidyverse team released ragnar 0.2.0 on July 12. Everything here should still work, but take a look at the release notes to learn about some nice new features that aren’t covered here.


I’ve written a little about retrieval-augmented generation (RAG) here before. First, about GUIs for local LLMs with RAG:

…and later on building a little RAG app to chat with a bunch of PDFs in your Zotero library using Open WebUI:

In an oversimplified nutshell: LLMs can't help you with things that are not in their training data or are past their training cutoff date. With RAG, you can provide relevant snippets from those documents as context to the LLM so that its answers are grounded in a collection of known content from a trusted document corpus.

Even more oversimplified: RAG lets you “chat with your documents.”

In this post I’ll demonstrate how to scrape text from a website and implement a RAG workflow in R using a new addition to the tidyverse: ragnar, along with functionality from ellmer to interact with LLM APIs through R.

Demonstration

Python has historically dominated the AI product development space, but with recent additions like ellmer, chores, gander, and mall, R is quickly catching up.

Here I’m going to use the new ragnar package in the tidyverse (source, documentation) to build a little RAG workflow in R that uses the OpenAI API.

I’m going to ingest information from the UVA School of Data Science (SDS) website at datascience.virginia.edu, then ask some questions that won’t have answers in the base model’s training data.

Setup

If you want to follow along you’ll need an OpenAI API key. You can set that up at platform.openai.com. Once you do that, run usethis::edit_r_environ() to add a new OPENAI_API_KEY environment variable, and restart your R session.

In R I’m going to need the ellmer and ragnar packages. Because ragnar isn’t yet on CRAN, I'll have to install it with pak or devtools.

install.packages("ellmer")
pak::pak("tidyverse/ragnar")

Create a vector store

The first thing I want to do is to find all the links to other pages at datascience.virginia.edu, scrape all of that content, and stick it into a DuckDB database. Most of this is modified straight from the ragnar documentation, hence the context chunking still looks like I’m ingesting a book.

library(ragnar)

# Find all links on a page
base_url <- "https://datascience.virginia.edu/"
pages <- ragnar_find_links(base_url)

# Create and connect to a vector store
store_location <- "pairedends.ragnar.duckdb"
store <- ragnar_store_create(
  store_location,
  embed = \(x) ragnar::embed_openai(x),
  overwrite=TRUE
)

# Read each website and chunk it up
for (page in pages) {
  message("ingesting: ", page)
  chunks <- page |>
    ragnar_read(frame_by_tags = c("h1", "h2", "h3")) |>
    ragnar_chunk(boundaries = c("paragraph", "sentence")) |>
    # add context to chunks
    dplyr::mutate(
      text = glue::glue(
        r"---(
        # Excerpt from UVA School of Data Science (SDS) page"
        link: {origin}
        chapter: {h1}
        section: {h2}
        subsection: {h3}
        content: {text}

        )---"
      )
    )
  ragnar_store_insert(store, chunks)
}
# Build the index
ragnar_store_build_index(store)

Retrieval

Now suppose we want to ask questions about research grant funding at the UVA School of Data Science (SDS). First, let’s see what ChatGPT tells us without providing any context at all when we ask it to tell us about SDS’s active grant funding.

I don’t have specific information about “SDS’s active grant funding” because “SDS” could refer to several different organizations or entities (e.g., Social and Decision Sciences departments, Sustainable Development Solutions, Students for a Democratic Society, or a company with the acronym SDS).

Now let’s use the OpenAI API providing context from the vector store we just created that should have information about UVA SDS’s active funding from their active grants listing page.

library(ragnar)
store_location <- "pairedends.ragnar.duckdb"
store <- ragnar_store_connect(store_location, read_only = TRUE)

# Get relevant chunks
text <- "Tell me about some of SDS's active grant funding."
relevant_chunks <- ragnar_retrieve_vss(
  store,
  text,
  top_k = 10
)
relevant_chunks

#'  Register ellmer tool
#' You can register an ellmer tool to let the LLM retrieve chunks.
system_prompt <- stringr::str_squish(
  r"--(
  You are an expert about UVA's School of Data Science.
  You are concise. You always respond by first direct
  quoting material from the SDS website, then adding
  your own additional context and interpertation.
  Always include links to the source materials used.
  )--"
)
chat <- ellmer::chat_openai(
  system_prompt,
  model = "gpt-4.1-mini",
  params = ellmer::params(temperature = .5)
)

# Register a retrieve tool with ellmer
ragnar_register_tool_retrieve(chat, store, top_k = 10)

# Run the query
chat$chat(text)

Now when asking about some of SDS’s active grant funding, we get a better response:

Here are some examples of active grant funding at UVA's School of Data Science (SDS):

  • Resource and Data Exchange (RDE) has a grant award of $5,000,000 with a performance period from Feb 29, 2024 to Feb 27, 2029, currently 25% completed.

  • Biomedical Data Science Innovation Labs: An Intensive Research Project Development Program has an award amount of $1,943,352.

  • Supporting Pre-Service Teachers Mathematical Discourse through Co-Design of Teaching Simulation Tools is sponsored by the US NSF with a total award of $250,000.

  • Activity Aware Bayesian Deep Learning is sponsored by the US NSF.

  • Enhancing Security and Privacy of Community Cyberinfrastructures for Collaborative Research has a total award of $882,518 from the US NSF.

  • Evaluating the Efficacy of Repurposed Medications has an award amount of $86,134.

  • Accelerating racial and ethnic equity in SCI rehabilitation outcomes has an award amount of $53,170.

  • The application of data science on financial systems with an emphasis on safety and security has an award amount of $5,000.

These grants cover a range of topics from education, biomedical innovation, security, to social equity in rehabilitation.

Source: UVA SDS Active Grants

API cost and local models

As I’m writing this, the cost for GPT-4.1 mini is ridiculously cheap at $0.40 cents per million input tokens (see more on their API pricing page). The demonstration here cost me $0.01 cent (the text embedding and vector storage cost a fraction of a penny in addition to the input/output completions).

There are plenty of open/local models that support tool use, as well as open/local text embedding models, all of which can be run through Ollama. I tried the same exercise above using Nomic Embed through Ollama for text embedding, and tried several with tool calling abilities, including qwen3, mistral, llama3.1, llama3.2, llama3.3, and the new llama4, and the results were all terrible. I don't know if this was due to the inferiority of the models themselves, or if this was the embedding model that I chose, which incidentally happened to be the most popular embedding model available in Ollama. Just put $1 on your OpenAI API account and get to work and stop worrying about it.

Learning more

This recent webinar from Posit CTO Joe Cheng doesn’t cover RAG at all. In fact, he mentions near the top that RAG should not be your first choice when simply changing a system prompt would be good enough. It’s a good talk and I learned a few nice things along the way.

Tuesday, June 3, 2025

Repost: The Modern R Stack for Production AI

 Reposted from the original at: https://blog.stephenturner.us/p/r-production-ai

...

Python isn't the only game in town anymore: R can interact with local and cloud LLM APIs, inspect and modify your local R environment and files, implement RAG, computer vision, NLP, evals, & much more.

There was a time in late 2023 to early 2024 when I and probably many others in the R community felt like R was falling woefully behind Python in tooling for development using AI and LLMs. This is no longer the case. The R community, and Posit in particular, have been on an absolute tear bringing new packages online to take advantage of all the capabilities that LLMs provide. Here are a few that I’ve used and others I’m keeping a close eye on as they mature.

ellmer: interact with almost any LLM in R

I can't remember when I first started using Ollama to interact with local LLMs like Llama and Gemma, but I first used the ollamar package (CRANGitHubpkgdown) last summer, and wrote a post on using this package to ask Llama3.1 what’s interesting about a set of genes, or to summarize papers on bioRxiv.

Shortly after that, I wrote an R package to summarize bioRxiv preprints with a local LLM using ollamar:

In addition to the ollamar package, Johannes Gruber introduced the rollama package (CRANGitHubpkgdown) around the same time, though I haven’t used it myself.

Earlier this year Posit announced ellmer, a new package that allows you to interact with most major LLM providers, not just local models running via Ollama. The ellmer package supports ChatGPT, Claude, AWS Bedrock, Azure OpenAI, DeepSeek, Gemini, Hugging Face, Mistral, Perplexity, and others. It also supports streaming and tool calling. I wrote another post more recently demonstrating how to summarize Bluesky posts on a particular topic using ellmer:

The ellmer package’s documentation and vignettes are top-notch. Check it out.

chores: automate repetitive tasks

The chores package connects ellmer to your source editor in RStudio and Positron, providing a library of ergonomic LLM assistants designed to help you complete repetitive, hard-to-automate tasks quickly. These “assistants” let you do things like highlight some code and convert tests to testthat3document functions with roxygen, or convert error handling to use cli instead of stop or rlang. There’s a nice demonstration screencast on the documentation website.

The hex sticker for the chores package: A cartoon illustration of a light orange potato character with rosy cheeks, holding a clipboard with a checklist in one hand and several small cards in the other. The potato is set against a purple hexagon outlined in a lighter orange. The word 'chores' is written diagonally in white in the upper right of the purple hexagon.

gander: allow an LLM to talk to your R environment

The gander package feels kind of like a Copilot but it also knows how to talk to objects in your R environment. It can inspect file contents from elsewhere in the project that you're working on, and it also has context about objects in your environment, like variables, data frames, and functions. There’s a nice demonstration screencast on the documentation website.

The hex sticker for the gander package: a cartoonish goose swims on a green background with a blue 'reflection' below it.

btw: describe your R environment to an LLM

The btw package is brand new and still in development. You can use it interactively, where it assembles context on your R environment, package documentation, and working directory, copying the results to your clipboard for easy pasting into chat interfaces. It also allows you to wrap methods that can be easily incorporated into ellmer tool calls for describing various kinds of objects in R. I’d recommend reading Posit’s “Teaching chat apps about R packages” blog post.

A digital illustration on a pink background with white dots. A raven with a piece of paper in its beak in a colorful hexagon covered in various graphs and doodles. Dotted lines extend from the hexagon to four white rectangular documents with horizontal lines representing text. An 'AI' icon is in the upper right corner

ragnar: retrieval-augmented generation (RAG) in R

The ragnar R package helps implement Retrieval-Augmented Generation (RAG) workflows in R using ellmer to connect to any LLM you wish on the backend. It provides some really handy utility functions for reading files or entire websites into a data frame, converting files to markdown, and finding all links on a webpage to ingest. It helps you chunk text into sections, embed into a vector store (using duckdb by default), and retrieve relevant chunks to provide an LLM with context given a prompt.

I’m working on another post right now with a deeper dive into using ragnar.

vitals: LLM evaluations in R

LLM evaluation at R, aimed at ellmer users. It’s an R port of the widely adopted Python framework Inspect. As of this writing, the documentation notes that vitals is highly experimental and much of its documentation is aspirational.

kuzco: computer vision in R

The kuzco package is designed as a computer vision assistant, giving local models guidance on classifying images and return structured data. The goal is to standardize outputs for image classification and use LLMs as an alternative option to keras or torch. It currently supports classification, recognition, sentiment, and text extraction.

mall: use an LLM for NLP on your data

The mall package provides several convenience functions for sentiment analysis, text summarization, text classification, extraction, translation, and verification.

I recently used the mall package to run a quick sentiment analysis of #Rstats posts on Bluesky:

Functions in the mall package integrate smoothly with piped workflows in dplyr. For example:

reviews |>
  llm_sentiment(review)

Other resources

I think this is just the tip of the iceberg, and I can’t wait to see what else Posit and others in the R community are doing in this space.

Here's a recording from a recent webinar Joe Cheng (Posit CTO) gave on harnessing LLMs for data analysis.

You might also take a look at the recordings from posit::conf(2024) which include a few AI/LLM-focused talks.

Also check out the posit::conf(2025) schedule at https://posit.co/conference/. There’s a workshop on Programming with LLM APIs: A Beginner’s Guide in R and Python, four talks in a session titled LLMs with R and Python, several lightning talks that will likely cover LLMs in R, and four more talks in a session titled Facepalm-driven Development: Learning From AI and Human Errors.

The R community has clearly stepped up. Whether you're building prototypes, shipping production tools, or just exploring what LLMs can do, R is now a real and robust option. I’m excited to see where we go from here.

Find this useful? Buy me a coffee! ☕️

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.