A/B compartment calling#

This notebook demonstrates how to use ArcFISH to call A/B compartments from Su et al, 2020. The subset used here can be downloaded from zenodo repository 3928890 with file name chromosome2_p-arm_replicate.tsv.

[1]:
import matplotlib.pyplot as plt
import arcfish as af

Since the original file is not in FOF_CT-core format, we add two lines at the beginning of the file and convert it to a csv. The name of this new file is chromosome2_p-arm_replicate.csv.

Let’s examine the file format by reading the first 5 lines to see the structure of the converted FOF_CT-core file.

[2]:
with open("chromosome2_p-arm_replicate.csv", "r") as f:
    lines = "".join([f.readline() for _ in range(5)])
print(lines)
##Table_namespace=4dn_FOF-CT_core
##columns=(Spot_ID,Trace_ID,X,Y,Z,Chrom,Chrom_Start,Chrom_End)
0,1,26382.0,15407.0,4659.0,chr2,1,50001
1,1,26238.0,15347.0,3963.0,chr2,250001,300001
2,1,25270.0,15113.0,4220.0,chr2,500001,550001

Load and preprocess the data#

Create a FOF_CT_Loader object for the converted file and check which chromosomes are available.

[3]:
loader = af.pp.FOF_CT_Loader("chromosome2_p-arm_replicate.csv",
                             nm_ratio={c: 1000 for c in "XYZ"})
loader.chr_ids
[3]:
array(['chr2'], dtype=object)

Load the data for chromosome 2 and subset to the first 100 cells for faster computation.

[4]:
adata = loader.create_adata("chr2")
adata = adata[:100,:].copy()
adata
[4]:
AnnData object with n_obs × n_vars = 100 × 357
    var: 'Chrom_Start', 'Chrom_End'
    uns: 'Chrom'
    layers: 'X', 'Y', 'Z'

Calculate the median pairwise distance matrix, which will be stored in the adata.varp slot.

[5]:
af.pp.median_pdist(adata, inplace=True)

Apply filtering and normalization to the data before compartment calling.

[6]:
af.pp.filter_normalize(adata)
[7]:
adata
[7]:
AnnData object with n_obs × n_vars = 100 × 357
    var: 'Chrom_Start', 'Chrom_End'
    uns: 'Chrom'
    layers: 'X', 'Y', 'Z'
    varp: 'med_dist', 'raw_var_X', 'raw_var_Y', 'raw_var_Z', 'count_X', 'count_Y', 'count_Z', 'var_X', 'var_Y', 'var_Z'

A/B compartment calling#

Create an ABCaller object with minimum compartment size of 0 and run compartment calling on the processed data.

[8]:
caller = af.tl.ABCaller(min_cpmt_size=0)
cpmt_result = caller.call_cpmt(adata)
cpmt_result.head()
[8]:
c1 s1 e1 cpmt eig_x eig_y eig_z wteig_x wteig_y wteig_z
0 chr2 1 50001 0 -0.064820 -0.050931 0.018169 -0.016317 -0.015034 0.008232
1 chr2 250001 300001 1 -0.015068 0.000185 0.013013 -0.003793 0.000055 0.005896
2 chr2 500001 550001 1 0.027947 0.061568 0.021069 0.007035 0.018174 0.009546
3 chr2 750001 800001 1 0.042551 0.016878 0.024852 0.010711 0.004982 0.011260
4 chr2 1000001 1050001 1 0.010925 0.039898 0.016105 0.002750 0.011777 0.007297

Visualize the results by creating a heatmap of the distance matrix with compartment bars overlaid to show the A/B compartment structure.

[9]:
fig, ax = plt.subplots(figsize=(4, 3))
af.pl.pairwise_heatmap(adata.varp["med_dist"], ax=ax)
af.pl.cpmt_bars(cpmt_result["cpmt"], ax=ax)
../_images/tutorial_cpmt_19_0.png

Computation speed up with dask#

If on a computation cluster with multiple nodes, the computation can be sped up by distributing it to multiple nodes. This is possible since ArcFISH uses dask to do the computation in the background.

The following demonstrates how to do this on a SLURM computation cluster. For other cluster types, please refer to dask’s official documentation.

[ ]:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
cluster = SLURMCluster(
    queue="general",
    cores=10,
    memory="32 GB",
    log_directory="../log"
)
cluster.scale(jobs=20)
client = Client(cluster)
[ ]:
adata = loader.create_adata("chr2")
af.pp.filter_normalize(adata)
[ ]:
cluster.close()