Reposted from Paired Ends at https://blog.stephenturner.us/p/llm-translate-documentation.
---
The lang package overrides the ? and help() functions in your R session. The translated help page will appear in the help pane in RStudio or Positron. It can also translate your Roxygen documentation.
Using LLMs in R
Most of the developer tooling for AI/LLM training and evaluation is Python-centric, but just over the past few months we’ve seen a surge of new tooling for AI/LLM applications for the R ecosystem.
ollamar and rollama provide wrappers around the Ollama API allowing you to run LLMs locally on your machine. I recently wrote a few posts, one demonstrating how to use ollamar, and another demonstrating a package that uses ollama internally.
elmer is a new package in the tidyverse that allows you to interact with many different LLMs via R (Claude, ChatGPT, Gemini, and Ollama too.
mall is an interesting one that provides an easy way to run multiple LLM predictions against a data frame (sentiment analysis, summarization, classification, extraction, translation, etc).
pal provides LLM assistants for doing things like highlighting code and asking for things like roxygen documentation, testthat tests, etc.
Shiny Assistant helps you explain how things work in Shiny, and can help build Shiny apps for you (in either R or Python).
The lang package
The lang package (source, documentation) is an interesting new addition to the mlverse in R. From the documentation:
lang
overrides the?
andhelp()
functions in your R session. If you are using RStudio or Positron, the translated help page will appear in the usual help pane.If you are a package developer,
lang
helps you translate your documentation, and to include it as part of your package.lang
will use the same?
override to display your translated help documents.
Let’s look at an example. I recently invited my colleague and co-author VP Nagraj to write about the rplanes package we published and released on CRAN for plausibility analysis in epidemiological forecasting.
One of the first functions you might use from this package is read_forecast()
, which reads a probabilistic quantile forecast CSV file for downstream plausibility analysis. Let’s look at the help for this function.
library(rplanes)
?read_forecast
En Español
Now let’s get help in Spanish.1 load the lang package and tell it that we’re using llama3.2.2 We’ll set the system language to Spanish, then ask for help again.
Sys.setenv(LANGUAGE="spanish")
?read_forecast
My fluency in Spanish is limited to general conversation and travel needs so I can’t easily verify the accuracy of the translation of this technical language, but when I ran some of this back through Google Translate it seemed to be mostly faithful. Notice how things that shouldn’t be translated aren’t — function names, arguments, columns in the returned output, code in the examples.
हिंदी में … … باللغة العربية
What about non-Western languages?
Let’s try Hindi!
Sys.setenv(LANGUAGE="hindi")
?read_forecast
I can’t verify the accuracy of this translation beyond running some of the text back through Google Translate, but in doing so at first glance the translation isn’t bad.
What about Arabic?
Sys.setenv(LANGUAGE="arabic")
?read_forecast
If you’re a native speaker of any of these, I’d love to know what you think. Chat with me on Bluesky (@stephenturner.us).
Translating your package’s Roxygen docs
The lang documentation has a great section on using lang as a package developer. You can translate all of your Roxygen documentation into the desired language, then edit those translations by hand as needed. Then a special helper function re-roxygenizes your docs placing them in a special inst/man-lang folder. The lang docs explain how this all works, but once you do this, when a user has the lang package loaded, they’ll get your pre-computed and optionally edited translations instead of having to wait around for the LLM to translate the help.
Demo
Here’s a demo using a very small package I wrote for something completely different. Don’t worry about all the Docker stuff described here. There’s one single function, missyelliot()
, that simply reverse complements a DNA sequence (“take that flip it and reverse it”). That is, it’ll convert GATTACA to it’s reverse complement TGTAATC.
Restart your R environment, and install the package using devtools/remotes. Load both rpdd and lang.
devtools::install_github("stephenturner/rpdd")
library(rpdd)
library(lang)
Now get some help for missyelliot()
. If your language environment variable is English, you’ll get the English help.
Now, change your system language to Spanish, and try it again. Notice how the translated help is instantaneous — you’re relying on the pre-translated and possibly hand-edited translations that come with the package rather than asking an LLM to translate the help for you on the fly.
Sys.setenv(LANGUAGE = "spanish")
?missyelliot
If your language is set to something without a pre-populated translation, you’ll have to register a model through Ollama and translate in real time.
Sys.setenv(LANGUAGE = "russian")
?missyelliot
I think this might be one of the most impactful applications of LLMs inside a developer environment since the rise and rapid adoption of Copilots. The ability to instantly access documentation in multiple languages through lang represents a significant step forward in making data science more accessible and inclusive for the global R community, breaking down language barriers that have historically made it challenging for non-English speakers to fully engage with R's rich ecosystem of tools and packages.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.