Saturday, August 24, 2024

biorecap: an R package for summarizing bioRxiv preprints with a local LLM

This is re-posted from my newsletter, where I'll be posting from now on:

https://blog.stephenturner.us/p/biorecap-r-package-for-summarizing-biorxiv-preprints-local-llm



---

TL;DR

  • I wrote an R package that summarizes recent bioRxiv preprints using a locally running LLM via Ollama+ollamar, and produces a summary HTML report from a parameterized RMarkdown template. The package is on GitHub and you can install it with devtools: https://github.com/stephenturner/biorecap.

  • I published a paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

  • I wrote both the package and the paper with assistance from LLMs: GitHub copilot for documentation, llama3.1:70b for tests, llama3.1:405b via HuggingFace assistants for drafting, GPT-4o for editing.

The biorecap package

I recently started to explore prompting a local LLM (e.g. Llama3.1) from R via Ollama. Last week I wrote about how to do this, with a few contrived examples: (1) trying to figure out what’s interesting about a set of genes, and (2) summarizing a set of preprints retrieved from bioRxiv’s RSS feed. The gene set analysis really was contrived — as I mentioned in the footnote to that post, you’d never actually want to do a gene set analysis this way when there are plenty of well-understood first-principles approaches to gene set analysis.

The second example wasn’t so contrived. I subscribe to bioRxiv’s RSS feeds, along with many other journals and blogs in genetics, bioinformatics, data science, synthetic biology, and others. The fusillade of new preprints and peer-reviewed papers relevant to my work is relentless. Late last year bioRxiv started adding AI summaries to newly published preprints, but this required multiple clicks to get out of my RSS feed onto the paper’s landing page, another to click into the AI summary. I wanted a way to give me a quick TL;DR on all the recent papers published in particular subject areas (e.g., bioinformatics, genomics, or synthetic biology).

Shortly after putting together that one-off demo, on a sultry Sunday afternoon in Virginia too hot to do anything outside, I took the code from that post, generalized it a bit, and created the biorecap package: https://github.com/stephenturner/biorecap.

You can install it with devtools/remotes:


remotes::install_github("stephenturner/biorecap", 
                        dependencies=TRUE)


Create a report with 2-sentence summaries of recent papers published in the bioinformatics, genomics, and synthetic biology sections, using llama3.1:8b running locally on your laptop:


my_subjects <- c("bioinformatics", "genomics", "synthetic_biology")

biorecap_report(output_dir=".", 
                subject=my_subjects, 
                model="llama3.1")


This report will look different each time you run it, depending on what’s been published that day in the sections you’re interested in.

Figure 1 from Turner 2024 arXiv: Example biorecap report for bioinformatics, genomics, and synthetic biology from August 6, 2024.


The package documentation provides further instructions on usage.

I mentioned I used LLMs to help write this package. I started out using Positron for package development, but quickly fell back to RStudio because I wanted GitHub copilot integration. Among other things, Copilot is great for quickly writing Roxygen documentation for relatively simple functions. I also used llama3.1:70b running locally via Open WebUI to help me write some of the unit tests using testthat. Starting from the code I worked out in the previous post, it took about 2 hours to get the package working, and another hour or so to write documentation and tests and set up GitHub actions for pkgdown deployment and R CMD checks on PRs to main.

The biorecap paper

I published a short paper about the package on arXiv: Turner, S. D. (2024). biorecap: an R package for summarizing bioRxiv preprints with a local LLM. arXiv, 2408.11707. https://doi.org/10.48550/arXiv.2408.11707.

biorecap-preprint
266KB ∙ PDF file
Download
The biorecap preprint published on arXiv

I used two LLMs to assist me in writing the package, which itself uses an LLM to summarize research published on bioRxiv. It only makes sense to close the circle and use an LLM to help me write a preprint to publish on arXiv1 describing the software.

I’ll write a post soon about how to set up a llama3.1:405b-powered chatbot on HuggingFace connected to a GitHub repository to be able to ask questions about the codebase in that repo. It’s free and takes about 60 seconds or less. I started out doing this, asking for help crafting a narrative outline and introduction section for a paper based on the code in the repo. I ended up scrapping this and writing most of the first draft text myself, then using GPT-4o to help with editing, tone, and length reduction. It’s hard to exactly put my finger on it, but what I ended up with still had that feeling of “sounds like it was written by ChatGPT.” I did some of the final editing on my own, and used Mike Mahoney’s arXiv template Quarto plugin to write and typeset the final document.

Wednesday, August 14, 2024

Use R to prompt a local LLM with ollamar

This is reposted from the original article: 

https://blog.stephenturner.us/p/use-r-to-prompt-a-local-llm-with

Use R to prompt a local LLM with ollamar: Using R to prompt llama3.1:70b running on my laptop with Ollama + ollamar to tell me what's interesting about a set of genes, and to summarize recent bioRxiv preprints


Subscribe at https://blog.stephenturner.us/ to get future posts like this delivered to your e-mail.

--------------------

I’ve been using the llama3.1:70b model just released by Meta using Ollama running on my MacBook Pro. Ollama makes it easy to talk to a locally running LLM in the terminal (ollama run llama3.1:70b) or via a familiar GUI with the open-webui Docker container.

Here I’ll demonstrate using the ollamar package on CRAN to talk to an LLM running locally on my Mac. I’ll demonstrate this by asking llama3.1-70b what it thinks about a set of genes1 and to summarize recent preprints published on bioRxiv’s Scientific Communication and Education channel using the smaller+faster 8B model.

Tell me what’s interesting about these genes

Genes involved in the G2/M checkpoint

First I’ll use the msigdbr R package to pull the gene symbols for all the genes involved in the G2/M checkpoint in the cell cycle progression from MSigDb.


library(msigdbr)
hm <- msigdbr(species="human", category="H")
gs <- unique(hm[hm$gs_name=="HALLMARK_G2M_CHECKPOINT",]$gene_symbol)

Next, I’ll load the ollamar package, and test the connection to the Ollama server.

library(ollamar)
test_connection()

If all goes well you should see something like the following:

Ollama local server running
<httr2_response>
GET http://localhost:11434/
Status: 200 OK
Content-Type: text/plain
Body: In memory (17 bytes)


The ollamar README has good documentation on how to pull models, list which models you have available, etc.


Next, I’ll collapse the gene symbol vector I created above to a string (“CDK1, SMC2, POLA2, …”), and create a prompt tacking on this list of gene symbols to the end. Finally, I’ll use the generate() function with this prompt I just created to prompt the llama3.1:70b2 model, and print the results.


# Collapse the symbol vector to a string ("GENE1, GENE2, GENE3")
gs_string <- paste(gs, collapse=", ")

# Create a prompt
prompt <- paste("I'm going to give you some gene symbols.
                Anything interesting about this gene set? ",
                gs_string)

# Prompt the model
res <- generate("llama3.1:70b", prompt=prompt, output="text")

# Print the results
cat(res$response)


The prompt returns markdown text which looks nice after rendering with RMarkdown/Quarto:

What a massive list of gene symbols!

After analyzing the list, I noticed several interesting aspects:

  1. Cell cycle and DNA replication: Many genes are involved in cell cycle regulation (e.g., CDK1, CDK4, CDC20, CDC25A, CDC45), DNA replication (e.g., ORC5, ORC6, MCM2, MCM3, MCM5), and chromosome segregation (e.g., BUB1, BUB3, CENPA, CENPE).

  2. Apoptosis and cell death: Several genes are associated with apoptosis (programmed cell death) or regulation of cell death pathways, including CASP8AP2, BIRC5, and PML.

  3. Cancer-related genes: A significant number of genes have been implicated in cancer development or progression, such as MYC, CCND1, CCNE1, CDC25A, and PLK1. Some genes are tumor suppressors (e.g., BRCA2, TP53), while others may promote tumorigenesis.

  4. Chromatin modification and epigenetics: Genes like EZH2, CUL4A, and KMT5A are involved in chromatin modification, histone regulation, or other epigenetic processes that influence gene expression.

  5. Transcriptional regulation: Many genes encode transcription factors (e.g., E2F1, MYC, HIF1A) or co-factors that regulate gene expression, including some with roles in cell cycle progression and apoptosis.

  6. Nuclear structure and function: Genes like LBR, LIG3, and NUP50 are involved in nuclear envelope organization, pore complex formation, or nucleocytoplasmic transport.

These observations suggest that this gene set is enriched for genes involved in fundamental cellular processes such as cell cycle regulation, DNA replication, apoptosis, chromatin modification, and transcriptional control. This might indicate a specific biological context or experimental condition under which these genes are co-expressed.

Would you like to know more about any of these aspects or explore further?

Random genes

For kicks I did this with a random selection of 50 genes from the human genome:


allgenesets <- msigdbr(species="Homo sapiens")
allgenes <- unique(allgenesets$gene_symbol)
set.seed(42)
gs_random <- sample(allgenes, 50)

Llama3.1 is doing a good job HARKing about what these genes could be doing:

After analyzing the list, I noticed a few interesting things:

  1. High proportion of pseudogenes: About 20-25% of the genes on your list are labeled as "pseudogenes" (e.g., RN7SL677P, RPL27AP3, HYAL6P). Pseudogenes are inactive copies of functional genes that have lost their ability to code for proteins. They can still be transcribed and play roles in regulating gene expression.

  2. Ribosomal protein pseudogenes: Many of the pseudogenes on your list (e.g., RPL27AP3, RN7SL677P) appear to be related to ribosomal proteins, which are essential for protein synthesis.

  3. MicroRNAs and small nuclear RNAs: You have several microRNA genes (MIRs) and small nuclear RNA (snRNA) genes on your list (e.g., MIR4444-2, RNU7-160P). These types of non-coding RNAs play crucial roles in regulating gene expression.

  4. Tyrosine phosphatase and kinases: The presence of PTPN7 (a tyrosine phosphatase) and TYK2 (a Janus kinase) on your list suggests a possible connection to signaling pathways involved in cell growth, differentiation, or immune responses.

  5. Vascular endothelial growth factor A (VEGFA): This gene is crucial for angiogenesis (the formation of new blood vessels). Its presence on your list might indicate a role in vascular development or disease.

These observations are just a starting point, and further analysis would be needed to determine the significance of these genes in a specific biological context. Do you have any additional information about how this gene set was obtained or what type of study it's related to?

Summarize the latest papers on bioRxiv

Above I created a single prompt with a single call to generate(). Here I’ll create many prompts based on data in columns of a data frame, and purrr::map() the generate() function over all of those prompts to create a new column of responses in my data frame.


I’ll demonstrate this by asking for a two sentence summary of the most recent bioRxiv preprints. I’ll start by pulling the latest titles and abstracts from the Scientific Communication and Education channel on bioRxiv. I use the tidyrss package to pull this information directly from the bioRxiv RSS feed. This returns a data frame with title, abstract, and other information about the most recent preprints in the feed.


library(tidyRSS)
library(tidyverse)

# Parse the feed
feed <- tidyfeed("https://connect.biorxiv.org/biorxiv_xml.php?subject=scientific_communication_and_education")

# Show a few titles
head(feed$item_title, 3)

# Show a few abstracts
head(feed$item_description, 3)


Here are the latest three titles (as of August 3, 2024).3 The abstracts that go with them (not shown) are in the feed$item_description column.


[1] "\nBiological changes, political ideology, and scientific communication shape human perceptions of pollen seasons \n" 
[2] "\nAn updated and expanded characterization of the biological sciences academic job market \n"                        
[3] "\n\"I'd like to think I'd be able to spot one if I saw one\": How science journalists navigate predatory journals \n"


Next, I’ll take the first 20 lines of the feed, pull out the title and abstract with the select statement, remove leading and trailing whitespace with the first mutate, construct a prompt with the second mutate, and generate a response with llama3.1 with the last mutate with purrr::map_chr(). Here I’m using the smaller/faster llama3.1:8b model (the default unless you specify :70b).


summarized <-
  feed |>
  head(20) |>
  select(title=item_title, abstract=item_description) |>
  mutate(across(everything(), trimws)) |>
  mutate(prompt=paste(
    "\n\nI'm going to give you a paper's title and abstract.",
    "Can you summarize this paper in 2 sentences?",
    "\n\nTitle: ", title, "\n\nAbstract: ", abstract)) |>
  mutate(response=purrr::map_chr(prompt, \(x) 
                                 generate("llama3.1",
                                          prompt=x,
                                          output="text")$response))


This is an example prompt, which will get run for every title and abstract in the feed:


I'm going to give you a paper's title and abstract. Can you summarize this paper in a few sentences? 

Title:  An updated and expanded characterization of the biological sciences academic job market 

Abstract:  In the biological sciences, many areas of uncertainty exist regarding the factors that contribute to success within the faculty job market. Earlier work from our group reported that beyond certain thresholds, academic and career metrics like the number of publications, fellowships or career transition awards, and years of experience did not separate applicants who received job offers from those who did not. Questions still exist regarding how academic and professional achievements influence job offers and if candidate demographics differentially influence outcomes. To continue addressing these gaps, we initiated surveys collecting data from faculty applicants in the biological sciences field for three hiring cycles in North America (Fall 2019 to the end of May 2022), a total of 449 respondents were included in our analysis. These responses highlight the interplay between various scholarly metrics, extensive demographic information, and hiring outcomes, and for the first time, allowed us to look at persons historically excluded due to ethnicity or race (PEER) status in the context of the faculty job market. Between 2019 and 2022, we found that the number of applications submitted, position seniority, and identifying as a women or transgender were positively correlated with a faculty job offer. Applicant age, residence, first generation status, and number of postdocs, however, were negatively correlated with receiving a faculty job offer. Our data are consistent with other surveys that also highlight the influence of achievements and other factors in hiring processes. Providing baseline comparative data for job seekers can support their informed decision-making in the market and is a first step towards demystifying the faculty job market.


Now the response column in my new table has all the responses from the LLM. I can now put the title and summary in a table and render it with pandoc.


summarized |> 
  select(title, response) |> 
  mutate(response=gsub("\n", " ", response)) |> 
  knitr::kable()
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.