Wednesday, September 11, 2024

Create a free Llama 3.1 405B-powered chatbot on any GitHub repo in 1 minute (cross-posted from Paired Ends)

This blog has moved. This is reposted from Paired Ends:

https://blog.stephenturner.us/p/create-a-free-llama-405b-llm-chatbot-github-repo-huggingface


Llama 3.1 405B is the first open-source LLM on par with frontier models GPT-4o and Claude 3.5 Sonnet. I’ve been running the 70B model locally for a while now using Ollama + Open WebUI, but you’re not going to run the 405B model on your MacBook.

Here I demonstrate how to create and deploy a Llama 3.1 405B-powered chatbot on any GitHub repo in <1 minute on HuggingFace Assistants, using an R package as an example


Create and deploy a HuggingFace Assistant

I’m going to use the tfboot R package as an example here (paperGitHub). I wrote the tfboot package to provide methods for bootstrapping transcription factor binding site disruption to statistically quantify the impact across gene sets of interest compared to an empirical null distribution. The package is meant to integrate with Bioconductor data structures and workflows on the front end, and Tidyverse-friendly tools on the back end. You can read more about the package in the paper.

The 42-second video below demonstrates how to create & deploy your chatbot.




  1. First, go to HuggingFace Assistants (https://huggingface.co/chat/assistants) and click Create new assistant.
  2. Fill in some details. Give your chatbot a name and description, and a system prompt (“You are a chatbot that answers questions about the tfboot codebase”).
  3. Select meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 as your model.
  4. Fill in some example prompts, like “what does this package do?” or “how do I do X, Y, or Z with this tool?
  5. Now, the important part. Under internet access, select “Specific Links” and provide the URL to the GitHub Repo.
  6. Hit create, then activate. You’re done.

Demo with the tfboot R package

Once you create and activate your model, you’ll see an interface that will look familiar if you’ve ever used ChatGPT or similar LLMs.

Landing page for the new chatbot I made for the tfboot GitHub repo.


From here you can click one of the example prompts, or type your own prompt. Let’s give it a try. First, a softball pitch. What does this package do? This should be fairly obvious from the README.

Prompt: What’s this package do?


Next, let’s get a little basic info on usage. The chatbot looks through the package’s RMarkdown vignettes and pulls out a high-level protocol on how to run the analysis. There wasn’t much context on what motifbreakR is or what you have to do upstream of running tfboot, but further prompting can help with this.

Prompt: How do I assess the statistical significance of transcription factor binding sites in gene sets of interest?


Finally, let’s see what it can tell us about the statistical underpinnings of what the package is doing? Of note, this isn’t simply regurgitation of what’s in the package documentation or vignettes. It’s using a combination of the code and documentation itself and integrating that with general information about bootstrapping, null hypothesis significance testing, and transcription factor binding site disruption analysis.

Prompt: Can you explain the theory that underpins what this package does?


Keep in mind that any assistant you create will be public. You can play around with the tfboot chatbot here. Also, know that the 405B model is extremely resource intensive. A few times the bot would timeout and I’d have to retry the prompt. This happens far less often with the 70B model, and the response times are faster. You might experiment around and see for yourself where the speed/accuracy sweet spot is for your specific needs.

Saturday, August 24, 2024

biorecap: an R package for summarizing bioRxiv preprints with a local LLM

This is re-posted from my newsletter, where I'll be posting from now on:

https://blog.stephenturner.us/p/biorecap-r-package-for-summarizing-biorxiv-preprints-local-llm



---

TL;DR

  • I wrote an R package that summarizes recent bioRxiv preprints using a locally running LLM via Ollama+ollamar, and produces a summary HTML report from a parameterized RMarkdown template. The package is on GitHub and you can install it with devtools: https://github.com/stephenturner/biorecap.

  • I published a paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

  • I wrote both the package and the paper with assistance from LLMs: GitHub copilot for documentation, llama3.1:70b for tests, llama3.1:405b via HuggingFace assistants for drafting, GPT-4o for editing.

The biorecap package

I recently started to explore prompting a local LLM (e.g. Llama3.1) from R via Ollama. Last week I wrote about how to do this, with a few contrived examples: (1) trying to figure out what’s interesting about a set of genes, and (2) summarizing a set of preprints retrieved from bioRxiv’s RSS feed. The gene set analysis really was contrived — as I mentioned in the footnote to that post, you’d never actually want to do a gene set analysis this way when there are plenty of well-understood first-principles approaches to gene set analysis.

The second example wasn’t so contrived. I subscribe to bioRxiv’s RSS feeds, along with many other journals and blogs in genetics, bioinformatics, data science, synthetic biology, and others. The fusillade of new preprints and peer-reviewed papers relevant to my work is relentless. Late last year bioRxiv started adding AI summaries to newly published preprints, but this required multiple clicks to get out of my RSS feed onto the paper’s landing page, another to click into the AI summary. I wanted a way to give me a quick TL;DR on all the recent papers published in particular subject areas (e.g., bioinformatics, genomics, or synthetic biology).

Shortly after putting together that one-off demo, on a sultry Sunday afternoon in Virginia too hot to do anything outside, I took the code from that post, generalized it a bit, and created the biorecap package: https://github.com/stephenturner/biorecap.

You can install it with devtools/remotes:


remotes::install_github("stephenturner/biorecap", 
                        dependencies=TRUE)


Create a report with 2-sentence summaries of recent papers published in the bioinformatics, genomics, and synthetic biology sections, using llama3.1:8b running locally on your laptop:


my_subjects <- c("bioinformatics", "genomics", "synthetic_biology")

biorecap_report(output_dir=".", 
                subject=my_subjects, 
                model="llama3.1")


This report will look different each time you run it, depending on what’s been published that day in the sections you’re interested in.

Figure 1 from Turner 2024 arXiv: Example biorecap report for bioinformatics, genomics, and synthetic biology from August 6, 2024.


The package documentation provides further instructions on usage.

I mentioned I used LLMs to help write this package. I started out using Positron for package development, but quickly fell back to RStudio because I wanted GitHub copilot integration. Among other things, Copilot is great for quickly writing Roxygen documentation for relatively simple functions. I also used llama3.1:70b running locally via Open WebUI to help me write some of the unit tests using testthat. Starting from the code I worked out in the previous post, it took about 2 hours to get the package working, and another hour or so to write documentation and tests and set up GitHub actions for pkgdown deployment and R CMD checks on PRs to main.

The biorecap paper

I published a short paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

biorecap-preprint
266KB ∙ PDF file
Download
The biorecap preprint published on arXiv

I used two LLMs to assist me in writing the package, which itself uses an LLM to summarize research published on bioRxiv. It only makes sense to close the circle and use an LLM to help me write a preprint to publish on arXiv1 describing the software.

I’ll write a post soon about how to set up a llama3.1:405b-powered chatbot on HuggingFace connected to a GitHub repository to be able to ask questions about the codebase in that repo. It’s free and takes about 60 seconds or less. I started out doing this, asking for help crafting a narrative outline and introduction section for a paper based on the code in the repo. I ended up scrapping this and writing most of the first draft text myself, then using GPT-4o to help with editing, tone, and length reduction. It’s hard to exactly put my finger on it, but what I ended up with still had that feeling of “sounds like it was written by ChatGPT.” I did some of the final editing on my own, and used Mike Mahoney’s arXiv template Quarto plugin to write and typeset the final document.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.