Curate & link bulk RNA-seq datasets#

import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad

Let’s start simulating the output of a bulk RNA-seq pipeline:

Curate the count matrix#

ln.track()

Let’s query the file:

file = ln.File.filter(description="Merged Bulk RNA counts").one()

df = file.load()

If we look at it, we realize it deviates far from the tidy data standard [Wickham14], conventions of statistics & machine learning [Hastie09, Murphy12] and the major Python & R data packages.

Variables aren’t in columns and observations aren’t in rows:

df

Let’s change that and move observations into rows:

df = df.T

df

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])

var.head()

Now, let’s create an AnnData:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)

adata

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()

Validate & register the count matrix#

Let’s create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:

lb.settings.species = "saccharomyces cerevisiae"

Almost all gene IDs are validated:

curated_file = ln.File.from_anndata(
    adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)

Hence, let’s save this file:

curated_file.save()

Example queries#

We have two files in the file registry:

ln.File.filter().df()

curated_file.view_lineage()

Let’s by query by gene: …