Monday, November 25, 2024

Tech I'm thankful for (repost)

 Reposted from https://blog.stephenturner.us/p/tech-im-thankful-for-2024

Data science and bioinformatics tech I'm thankful for in 2024: tidyverse, RStudio, Positron, Bluesky, blogs, Quarto, bioRxiv, LLMs for code, Ollama, Seqera Containers, StackOverflow, ...

It’s a short week here in the US. As I reflect on the tools that shape modern bioinformatics and data science it’s striking to see how far we’ve come in the 20 years I’ve been in this field. Today’s ecosystem is rich with tools that make our work faster, better, enjoyable, and increasingly accessible. In this post I share some of the technology I'm particularly grateful for — from established workhorses that have transformed how we code and analyze data, to emerging platforms that are reshaping scientific communication and development workflows.

  • The tidyverse: R packages for data science. Needs no further introduction.

  • devtools + usethis + testthat: I use each of these tools at least weekly for R package development.

  • RstudioPositron, and VS Code: Most of the time I’m using a combination of VS Code and RStudio. My first experience with Positron was a positive one, and as several of my dealbreaker functionalities are brought into Positron, I imagine next year it’ll be my primary IDE for all aspects of data science.

  • Bluesky. This place feels like the “old” science Twitter of the late 00s / early teens. I wrote about Bluesky for Science to get you started. It’s so great to have a place for civil and good-faith discussions of new developments in science, to be able to create my own algorithmic feeds, and to create thermonuclear block/mute lists.

  • Slack communities. There are many special interest groups and communities with Slack/Discord communities open to anyone. A few that I’m a part of:

  • Blogs. Good old 2000s-era long form blogs. I blogged regularly at Getting Genetics Done for nearly a decade. Over time, Twitter made me a lazy blogger. My posts got shorter, fewer, and further between. I’m pretty sure the same thing happened to many of the blogs I followed back then. In an age where so much content on the internet is GenAI slop I’ve come to really appreciate long-form treatment of complex topics and deep dives into technical content. A few blogs I read regularly:

  • Quarto: The next generation of RMarkdown. I’ve used this to write papers, create reports, to create entire books (blog post coming soon on this one), interactive dashboards, and much more.

  • Zotero: I’ve been using Zotero for over 15 years, ever since Zotero was only a Firefox browser extension. It’s the only reference manager I’m aware of that integrates with Word, Google Docs, and RStudio for citation management and bibliography generation. The PDF reader on the iPad has everything I want and nothing I don’t — I can highlight and mark up a PDF and have those annotations sync across all my devices. Zotero is free, open-source, and with lots of plugins that extend its functionality, like this one for connecting with Inciteful.

  • bioRxiv: bioRxiv launched about 10 years ago and every year gains more traction in the life sciences community. And attitudes around preprints today are so much different than they were in 2014 (“but what if I get scooped?”).

  • LLMs for code: I use a combination of GitHub Copilot, GPT 4o, Claude 3.5 Sonnet, and several local LLMs to aid in my development these days.

  • Seqera Containers: I’m not a Seqera customer, and I don’t (yet) use Seqera Containers URIs in my production code, but this is an amazing resource that I use routinely for creating Docker images with multiple tools I want. I just search for and add tools, and I get back a Dockerfile and a conda.yml file I can use to build my own image.

  • Ollama: I use Ollama to interact with local open-source LLMs on my Macbook Pro, for instances where privacy and security is of utmost concern.

  • StackOverflow: SO used to live in my bookmarks bar in my browser. I estimate my SO usage is down 90% from what it was in 2022. However, none of the LLMs for code would be what they are today without the millions of questions asked and answered on SO over the years. I’m not sure what this means for the future of SO and LLMs that rely on good training data.

Wednesday, November 20, 2024

Expand your Bluesky network with R (repost)

 This is reposted from the original at https://blog.stephenturner.us/p/expand-your-bluesky-network-with-r.

---

I’m encouraging everyone I know online to join the scientific community on Bluesky.

In that post I link to several starter packs — lists of accounts posting about a topic that you can follow individually or all at once to start filling out your network.

I started following accounts of people I knew from X and from a few starter packs I came across. One way to expand your network is to take all the accounts you follow, see who they are following but you aren’t. You can rank this list descending by the number of your follows who follow them, and use that list as a way to fill out your network.

Let’s do this with just a few lines of code in R. The atrrr package (CRANGitHubDocs) is one of several packages that wraps the AT protocol behind Bluesky, allowing you to interact with Bluesky through a set of R functions. It’s super easy to use and the docs are great.

The code below does this. It will first authenticate with an app password. It then retrieves all the accounts you follow. Next, it gets who all those accounts follow, and removes the accounts you already follow.1

library(dplyr)
library(atrrr)

# Authenticate first (switch out with your username)
bsky_username <- "youraccount.bsky.social"

# If you already have an app password:
bsky_app_pw <- "change-me-change-me-123"
auth(user=bsky_username, password=bsky_app_pw)

# Or be guided through the process
auth()

# Get the people you follow
f <- get_follows(actor=bsky_username, limit=Inf)

# Get just their handles
fh <- f$actor_handle

# Get who your follows are following
ff <-
  fh |>
  lapply(get_follows, limit=Inf) |>
  setNames(fh)

# Make it a data frame
ffdf <- bind_rows(ff, .id="follow")

# Get counts, removing ppl you already follow
ffcounts <-
  ffdf |>
  count(actor_handle, sort=TRUE) |>
  anti_join(f, by="actor_handle") |>
  filter(actor_handle!="handle.invalid")

# Join back to account info, add URL
ffcounts <-
  ffdf |>
  distinct(actor_handle, actor_name) |>
  inner_join(x=ffcounts, y=_, by="actor_handle") |>
  mutate(url=paste0("https://bsky.app/profile/",
                    actor_handle))

This returns a data frame of all the accounts followed by the people you follow, but who you don’t already follow, descending by the number of accounts you follow who follow them (mouthful right there).

Optional, but you can make this nicer by using the gt package to make a nice table with a clickable link.

# Optional, clean up and create a nice table
library(gt)
library(glue)
top <- 20L
ffcounts |>
  head(top) |>
  rename(Handle=actor_handle, N=n, Name=actor_name) |>
  mutate(Handle=glue("[{Handle}]({url})")) |>
  mutate(Handle=lapply(Handle, gt::md)) |>
  select(-url) |>
  gt() |>
  tab_header(
    title=md(glue("**My top {top} follows' follows**")),
    subtitle="Collected November 19, 2024") |>
  tab_style(
    style="font-weight:bold",
    locations=cells_column_labels()
  ) |>
  cols_align(align="left") |>
  opt_row_striping(row_striping = TRUE)

I can’t embed an HTML file here, but here’s what that output looks like. You can click any one of the names and follow the account if you find it useful.

Maybe you do this iteratively - add your top follows’ follows, then rerun the process a few times to possibly discover unknown second-degree connections.

The code here essentially replicates what @theo.io’s Bluesky Network Analyzer is doing, but all locally using R. That web app is faster and easier to use, and does some smart caching and throttling to avoid API rate limits. See the footnote for more.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.