ArchR-style Iterative LSI#

epione.tl.iterative_lsi ports ArchR’s addIterativeLSI (TF–log(IDF) + randomized SVD with cluster-guided variable-feature refinement + depth-correlation filtering). It runs directly on a peak or tile AnnData — no ArchR project required — and writes the embedding to adata.obsm['X_iterative_lsi'].

Data Preparation#

We use snapatac2’s bundled pbmc5k fragment file so the tutorial is self-contained. If you’re starting from your own fragments, go through epi.pp.import_fragments (see t_integrate) and skip to the iterative_lsi step.

%matplotlib inline
%load_ext autoreload
%autoreload 2

import os, pathlib
os.environ['XDG_CACHE_HOME'] = '/scratch/users/steorra/cache'

import numpy as np
import pandas as pd
import anndata as ad
import anndataoom as oom
import scanpy as sc
import epione as epi
import matplotlib.pyplot as plt

epi.pl.plot_set()

WORK = pathlib.Path('/scratch/users/steorra/data/pbmc5k_iter_lsi')
WORK.mkdir(parents=True, exist_ok=True)
└─ 🔬 Starting plot initialization...
  ├─ Apply Scanpy/matplotlib settings
  ├─ Custom font setup
  ├─ Suppress warnings
  ├─ 
___________      .__                      
\_   _____/_____ |__| ____   ____   ____  
 |    __)_\____ \|  |/  _ \ /    \_/ __ \ 
 |        \  |_> >  (  <_> )   |  \  ___/ 
/_______  /   __/|__|\____/|___|  /\___  >
        \/|__|                  \/     \/ 

  ├─ 🔖 Version: 0.0.1rc1   📚 Tutorials: https://epione.readthedocs.io/
└─ ✅ plot_set complete.

Build a tile matrix from fragments#

snap.pp.import_fragments → QC → add_tile_matrix(bin_size=5000)select_features. Same pattern as the snapatac2 quick-start.

%%time
h5 = WORK / 'pbmc5k.h5ad'
if h5.exists() and h5.stat().st_size > 100_000:
    data = oom.read(str(h5), backed='r+')
else:
    if h5.exists():
        h5.unlink()
    frag = epi.utils.register_datasets().fetch('atac_pbmc_5k.tsv.gz')
    data = epi.pp.import_fragments(
        str(frag),
        chrom_sizes=epi.utils.genome.hg38,
        file=str(h5),
        sorted_by_barcode=False,
    )
    genes = epi.utils.get_gene_annotation(epi.utils.genome.hg38)
    epi.pp.tsse(data, genes)
    data = epi.pp.filter_cells(data, min_counts=1000, max_counts=100000, min_tsse=5)
    epi.pp.add_tile_matrix(data, bin_size=5000)
    epi.pp.select_features(data, n_features=100000)
data
CPU times: user 1.09 s, sys: 1 s, total: 2.09 s
Wall time: 2.29 s
AnnDataOOM5,166obs×606,219varsRust · out-of-memory · backed
csr_matrix uint32 · 1.9% · ~87.3 MB/chunk · 674 MB disk · pbmc5k.h5ad
obs4n_fragment · frac_dup · frac_mito · tsse
namedtypepreview
n_fragmentuint6422067, 10498, 19191, … (4657)
frac_dupfloat640.5219143358537166, 0.5345186893096262, 0.5101962685995763, … (5165)
frac_mitofloat640.0
tssefloat6431.70564634504355, 29.54060864508625, 17.706611570247937, … (5027)
var2count · selected
namedtypepreview
countfloat640.0, 15.0, 3.0, … (6370)
selectedboolFalse, True
obsm1fragment_paired
keyshapedtype
fragment_paired(5166, 3088286401)uint32
emptyvarm · obsp · varp · layers · raw

Convert to an in-memory AnnData#

iterative_lsi operates on a standard AnnData; snapatac2’s on-disk store needs to be materialized first.

# Materialise the anndataoom BackedArray to an in-memory AnnData
# (scanpy / plotting expect a plain AnnData).
from scipy import sparse
parts = []
for _, _, ch in data.X.chunked():
    parts.append(ch)
if sparse.issparse(parts[0]):
    X = sparse.vstack(parts).tocsr()
else:
    X = sparse.csr_matrix(np.vstack(parts))

obs = pd.DataFrame(data.obs).copy()
obs.index = list(data.obs_names)
var = pd.DataFrame(data.var).copy()
var.index = list(data.var_names)
adata = ad.AnnData(X=X, obs=obs, var=var)
adata.obs['n_fragment'] = np.asarray(adata.X.sum(axis=1)).ravel()

if 'selected' in adata.var.columns:
    adata = adata[:, adata.var['selected'].astype(bool).values].copy()
adata
AnnData object with n_obs × n_vars = 5166 × 100000
    obs: 'n_fragment', 'frac_dup', 'frac_mito', 'tsse'
    var: 'count', 'selected'

Iterative LSI#

Two-round iterative LSI, ArchR defaults: 25k variable features per iteration, 30 components, depth-correlated components dropped (|r|>0.75). Pre-sampling to 10k cells in the first round keeps peak memory bounded on larger datasets.

%%time
epi.tl.iterative_lsi(
    adata,
    n_components=30,
    iterations=2,
    var_features=25_000,
    resolution=0.5,
    n_neighbors=30,
    sample_cells_pre=10_000,
    depth_col='n_fragment',
    cor_cut_off=0.75,
    seed=1,
)
adata.obsm['X_iterative_lsi'].shape
└─ [iterative_lsi] Initial feature set: 99,500 / 100,000
└─ [iterative_lsi] Iter 1/2 | fit on 5,166 cells x 99,500 features
computing neighbors
finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:36)
running Leiden clustering
finished: found 13 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
└─ [iterative_lsi]   -> 13 clusters; selected 25,000 variable features for next round
└─ [iterative_lsi] Iter 2/2 | fit on 5,166 cells x 25,000 features
└─ [iterative_lsi] Done. Stored embedding (5,166 x 29) in adata.obsm['X_iterative_lsi']
CPU times: user 2min 2s, sys: 3.14 s, total: 2min 5s
Wall time: 1min 13s
(5166, 29)
adata.uns['X_iterative_lsi']['kept_dims']
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29], dtype=int32)

Clustering and UMAP#

%%time
sc.pp.neighbors(adata, use_rep='X_iterative_lsi', n_neighbors=30)
sc.tl.leiden(adata, resolution=0.5, flavor='igraph',
             directed=False, n_iterations=2, random_state=0,
             key_added='leiden')
sc.tl.umap(adata, random_state=0)
adata.obs['leiden'].nunique()
computing neighbors
finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:02)
running Leiden clustering
finished: found 10 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
computing UMAP
finished: added
    'X_umap', UMAP coordinates (adata.obsm)
    'umap', UMAP parameters (adata.uns) (0:00:15)
CPU times: user 24.8 s, sys: 132 ms, total: 24.9 s
Wall time: 17.4 s
10
fig = epi.pl.umap(
    adata,
    color='leiden',
    legend_loc='on data',
    frameon=False,
    title='pbmc5k | iterative LSI → Leiden',
    return_fig=True,
    show=False,
)
fig

Cross-check with ArchR#

Below is a one-shot side-by-side on the ArchR hematopoiesis tutorial dataset (hg19). The R block is idempotent — it reuses cached inputs / project / embedding if already present and otherwise runs addIterativeLSI with ArchR’s defaults. It is skipped automatically when ArchR is not installed.

import subprocess
RSCRIPT  = '/scratch/users/steorra/env/CMAP/bin/Rscript'
HEME     = pathlib.Path('/scratch/users/steorra/data/archr_heme')
ARCHR_OK = subprocess.run(
    [RSCRIPT, '-e', 'cat(requireNamespace("ArchR", quietly=TRUE))'],
    capture_output=True, text=True,
).stdout.strip().endswith('TRUE')
ARCHR_OK
True
if ARCHR_OK:
    HEME.mkdir(parents=True, exist_ok=True)
    (HEME / 'get_data.R').write_text(f'''
    suppressPackageStartupMessages({{ library(ArchR) }})
    set.seed(1); addArchRThreads(threads = 8); addArchRGenome("hg19")
    setwd("{HEME}")
    if (!file.exists("inputFiles.rds"))
      saveRDS(getTutorialData("Hematopoiesis"), "inputFiles.rds")
    ''')
    if not (HEME / 'inputFiles.rds').exists():
        subprocess.run([RSCRIPT, str(HEME / 'get_data.R')], check=True)

    (HEME / 'run_archr.R').write_text(f'''
    suppressPackageStartupMessages({{
      library(ArchR); library(Matrix); library(SummarizedExperiment); library(S4Vectors)
    }})
    set.seed(1); addArchRThreads(threads = 8); addArchRGenome("hg19")
    setwd("{HEME}")
    inputFiles <- readRDS("inputFiles.rds")
    inputFiles <- setNames(file.path("{HEME}", unname(inputFiles)),
                           names(inputFiles))
    if (!dir.exists("ArrowFiles")) dir.create("ArrowFiles")
    setwd("ArrowFiles")
    ArrowFiles <- createArrowFiles(
      inputFiles, names(inputFiles),
      minTSS = 4, minFrags = 1000,
      addTileMat = TRUE, addGeneScoreMat = FALSE, force = FALSE
    )
    setwd("..")
    if (!dir.exists("ArchR_proj")) {{
      proj <- ArchRProject(ArrowFiles, outputDirectory = "ArchR_proj", copyArrows = FALSE)
      proj <- addIterativeLSI(proj, useMatrix = "TileMatrix",
                              name = "IterativeLSI", iterations = 2,
                              varFeatures = 25000, dimsToUse = 1:30,
                              LSIMethod = 2, seed = 1, force = TRUE)
      saveArchRProject(proj)
    }} else proj <- loadArchRProject("ArchR_proj")
    emb <- getReducedDims(proj, "IterativeLSI")
    write.table(cbind(barcode = rownames(emb), as.data.frame(emb)),
                "archr_embedding.tsv", sep = "\t", quote = FALSE, row.names = FALSE)
    se  <- getMatrixFromProject(proj, useMatrix = "TileMatrix", binarize = TRUE)
    mat <- assay(se, "TileMatrix")
    writeMM(mat, "tile_matrix.mtx")
    writeLines(colnames(mat), "tile_barcodes.txt")
    rr  <- rowRanges(se)
    writeLines(paste0(seqnames(rr), ":", start(rr), "-", end(rr)), "tile_features.txt")
    ''')
    if not (HEME / 'archr_embedding.tsv').exists():
        subprocess.run([RSCRIPT, str(HEME / 'run_archr.R')], check=True)
    list(HEME.glob('*.tsv'))

Load ArchR’s tile matrix and run epione on the same counts#

if ARCHR_OK:
    import scipy.io as sio
    mat  = sio.mmread(HEME / 'tile_matrix.mtx').tocsr()     # tiles × cells
    bc   = pd.read_csv(HEME / 'tile_barcodes.txt', header=None).iloc[:, 0].tolist()
    feat = pd.read_csv(HEME / 'tile_features.txt', header=None).iloc[:, 0].tolist()
    ad_archr = ad.AnnData(
        X=mat.T.tocsr(),
        obs=pd.DataFrame(index=bc),
        var=pd.DataFrame(index=feat),
    )
    ad_archr.obs['n_fragment'] = np.asarray(ad_archr.X.sum(axis=1)).ravel()
    ad_archr
%%time
if ARCHR_OK:
    epi.tl.iterative_lsi(
        ad_archr, n_components=30, iterations=2,
        var_features=25_000, resolution=2.0, n_neighbors=20,
        sample_cells_pre=10_000, depth_col='n_fragment', seed=1,
    )
└─ [iterative_lsi] Initial feature set: 500,000 / 6,072,620
└─ [iterative_lsi] Iter 1/2 | fit on 10,000 cells x 500,000 features
computing neighbors
finished: added to `.uns['neighbors']`
    `.obsp['distances']`, distances for each pair of neighbors
    `.obsp['connectivities']`, weighted adjacency matrix (0:00:04)
running Leiden clustering
finished: found 19 clusters and added
    'leiden', the cluster labels (adata.obs, categorical) (0:00:00)
└─ [iterative_lsi]   -> 19 clusters; selected 25,000 variable features for next round
└─ [iterative_lsi] Iter 2/2 | fit on 10,660 cells x 25,000 features
└─ [iterative_lsi] Done. Stored embedding (10,660 x 29) in adata.obsm['X_iterative_lsi']
CPU times: user 1min 38s, sys: 4.41 s, total: 1min 43s
Wall time: 36.9 s

Quantitative comparison#

  • Procrustes disparity — how well the two embeddings align after an optimal rigid transform.

  • kNN overlap — fraction of each cell’s 20 nearest neighbours shared between the two embeddings.

  • Leiden ARI — cluster-label agreement.

if ARCHR_OK:
    from scipy.spatial import procrustes
    from sklearn.neighbors import NearestNeighbors
    from sklearn.metrics import adjusted_rand_score

    archr  = pd.read_csv(HEME / 'archr_embedding.tsv', sep='\t').set_index('barcode')
    common = ad_archr.obs_names.intersection(archr.index)
    A = ad_archr.obsm['X_iterative_lsi'][ad_archr.obs_names.get_indexer(common)]
    B = archr.loc[common].to_numpy()
    k = min(A.shape[1], B.shape[1]); A, B = A[:, :k], B[:, :k]

    _, _, disparity = procrustes(A, B)
    Na = NearestNeighbors(n_neighbors=20, metric='cosine').fit(A).kneighbors(A, return_distance=False)
    Nb = NearestNeighbors(n_neighbors=20, metric='cosine').fit(B).kneighbors(B, return_distance=False)
    overlap = np.mean([len(set(Na[i]) & set(Nb[i])) / Na.shape[1]
                       for i in range(A.shape[0])])

    tmp = ad_archr[list(common)].copy()
    tmp.obsm['X_epi']   = A
    tmp.obsm['X_archr'] = B
    def _clusters(key):
        sc.pp.neighbors(tmp, use_rep=key, key_added=key, n_neighbors=20)
        sc.tl.leiden(tmp, resolution=0.4, key_added='l_' + key,
                     neighbors_key=key, flavor='igraph', directed=False,
                     n_iterations=2, random_state=0)
        return tmp.obs['l_' + key]
    ari = adjusted_rand_score(_clusters('X_epi'), _clusters('X_archr'))

    print(f'shared cells           : {len(common):,}')
    print(f'Procrustes disparity   : {disparity:.3f}')
    print(f'kNN overlap (k=20)     : {overlap:.3f}')
    print(f'Leiden ARI             : {ari:.3f}')
computing neighbors
finished: added to `.uns['X_epi']`
    `.obsp['X_epi_distances']`, distances for each pair of neighbors
    `.obsp['X_epi_connectivities']`, weighted adjacency matrix (0:00:09)
running Leiden clustering
finished: found 5 clusters and added
    'l_X_epi', the cluster labels (adata.obs, categorical) (0:00:00)
computing neighbors
finished: added to `.uns['X_archr']`
    `.obsp['X_archr_distances']`, distances for each pair of neighbors
    `.obsp['X_archr_connectivities']`, weighted adjacency matrix (0:00:01)
running Leiden clustering
finished: found 9 clusters and added
    'l_X_archr', the cluster labels (adata.obs, categorical) (0:00:00)
shared cells           : 10,660
Procrustes disparity   : 0.713
kNN overlap (k=20)     : 0.101
Leiden ARI             : 0.610

Notes#

On the ArchR heme dataset we typically see Leiden ARI ≈ 0.6 and kNN overlap ≈ 0.1 — close cluster structure but distinct local neighbour sets. This is expected: both algorithms compute LSI on the same counts but differ in variable-feature selection (highly_variable_genes-style fixed-variance ranking vs ArchR’s per-cluster means) and SVD implementations (randomized vs irlba). Downstream biology (cluster composition, UMAP topology) is effectively equivalent.