Curate & link bulk RNA-seq datasets#
Show code cell content
!lamin init --storage test-bulkrna --schema bionty
import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
Let’s start simulating the output of a bulk RNA-seq pipeline:
Show code cell content
# pretned we're running a bulk RNA-seq pipeline
ln.track(ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.dev.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts()
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.File(path, description="Merged Bulk RNA counts").save()
Curate the count matrix#
ln.track()
Let’s query the file:
file = ln.File.filter(description="Merged Bulk RNA counts").one()
df = file.load()
If we look at it, we realize it deviates far from the tidy data standard [Wickham14], conventions of statistics & machine learning [Hastie09, Murphy12] and the major Python & R data packages.
Variables aren’t in columns and observations aren’t in rows:
df
Let’s change that and move observations into rows:
df = df.T
df
Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.
Let’s create an AnnData object to model this. First, create a dataframe for the variables:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
Now, let’s create an AnnData:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
The AnnData object is in tidy form and complies with conventions of statistics and machine learning:
adata.to_df()
Validate & register the count matrix#
Let’s create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:
lb.settings.species = "saccharomyces cerevisiae"
Almost all gene IDs are validated:
curated_file = ln.File.from_anndata(
adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)
Hence, let’s save this file:
curated_file.save()
Example queries#
We have two files in the file registry:
ln.File.filter().df()
curated_file.view_lineage()
Let’s by query by gene: …
Show code cell content
!lamin delete test-bulkrna
!rm -r test-bulkrna