3DGenBench

Tutorial

3DGenBench is a web server for scoring performance of 3D genomic models. 3DGenBench provides two challenges. The first challenge aims at quantifying how accurate a model predicts experimental data. The second benchmark aims to estimate how well a model can predict changes in chromosome folding caused by structural genomic mutations.

Overview

There are five steps required to obtain 3DGenBench scores:

  1. Explore reference Hi-C dataset
  2. Generate computational predictions of Hi-C contacts for one or multiple samples (see example data)
  3. Upload your predictions to 3DGenBench server
  4. Provide samples metadata and compute metrics
  5. Explore metrics

Step 1. Explore Hi-C Dataset

There are two main dataset types for prediction. Rearrangement dataset (hereafter Paired) contains capture Hi-C and complementary epigenetic data, such as CTCF ChIP-seq, for wild-type and mutated samples (Homo sapiens, Mus musculus, Drosophila melanogaster cell lines). Genomic region dataset (hereafter Single) contains loci for prediction larger than 10 Mbp without known chromosome rearrangements. Those datasets can be found here.

Sample metadata include the following information:

chr, start prediction and end prediction columns describe the genomic region for which Hi-C interactions are expected to be predicted. Where start prediction bin corresponds to the interval (start_prediction-resolution)-start_prediction, the same rule is for end prediction bin. Rearr #n Start, Rearr #n end columns describe the rearrangement coordinates. For each sample there are several columns for rearrangement coordinates if multiple simultaneous mutations have been found in the region. The rearrangement type can be found in Rearrangement Type column. Also pay attention to the sample cell type, genomic assembly used, and available Hi-C map resolutions (5, 10, 20, 25, or 50 kb).

Hi-C maps for wild-type and mutated conditions are available in the most commonly used formats: hic, cool (for 5, 10, 20, 25, or 50 kb resolution), and pairs. Also, for most datasets we provided supplementary tracks describing CTCF binding. All the data can be downloaded via hyperlinks in the table. If you need to all available Hi-C data for one particular sample, please follow links in WT Archived Data or MUT Archived Data columns. Also you can explore dataset folder at our local FTP storage using hyperlinks in WT FTP Folder or MUT FTP Folder columns.

image/svg+xml

The detailed description of files can be found here.

If you need to download the entire Hi-C data set, use command:

wget -r -np https://genedev.bionet.nsc.ru/hic_out/by_Project/INC_COST_3DBenchmark/hic_dataset_zipped/

Also, you can download CTCF data in narrowPeak data format using links in CTCF Data column. These files have 2 additional columns with information about CTCF binding site orientation calculated using GimmeMotifs.

If you want to download the entire CTCF data set, use command:

wget -r -np https://genedev.bionet.nsc.ru/hic_out/by_Project/INC_COST_3DBenchmark/CTCF_data/

Step 2. Predict Hi-C Contacts or Insulation Score Data

Use your computational model to predict Hi-C contacts or insulation score data for one of the reference samples.

Hi-C Contacts Input

The predicted list of contacts should be provided as a tab-separated values (TSV) file which contains the following columns:

chr	contact_start	contact_end	contact_count

Where contact_start corresponds to the interval (contact_start - resolution)-contact_start, the same is for the contact_end.

An example file can be downloaded here.

Insulation Scores Input

The predicted insulation score track should be provided as a BedGraph file without header (technically a BedGraph-like TSV file). Columns are the following:

chrom	chrom_start	chrom_end	insulation_score

For Paired benchmark two predicted tracks should be provided, both for WT and Mutated samples.

Step 3. Upload Predicted Data

The data can be uploaded here. The uploaded files will be available in dropdown list here (see next Step).

Also, if you have too many files to upload, you can upload your data via FTP using any FTP client, such as FileZilla or WinSCP.

Protocol:      SFTP
Host name:     gate1.cytogen.ru
Port number:   8046
Username:      sftp_user
Password:      3DGenBench

Step 4. Provide Sample Metadata & Compute Metrics

Once the data is uploaded, go here, choose the type of prediction (Single, Paired, or insulation score-only for both types), then fill the form according to labels. You can use button to load example of predicted contacts file. Alternatively, example samples can be loaded as shown in the figure below.

image/svg+xml 3 DG en B ench 3 DG en B ench

The page allows you to submit predictions for several samples using button.

Step 5. Explore Metrics

The status of the submission is available at the link in success message, or here by ID. Cyan status of submission indicates your job is queued, orange means your job is running at the server, green status shows that the job was successfully completed, and red indicates that there was an error. If your job has failed, you can try and read job logs (at the bottom of job page), or you can contact us.

If the job ends successfully, you will see the metrics page. Those metrics describe prediction accuracy of your model (see the section below). You can find the example of computed metrics checking any ID with the green status or submitting a test unit as described above.

What Do Output Metrics Mean?

Those metrics reflect how well the model predicts experimental Hi-C data:

  • Spearman’s correlation between experimental and predicted Hi-C matrices
  • SCC (stratum adjusted correlation coefficient) from Yang et al. (2017), implemented by hicreppy with max_dist parameter equals to 1500000, between experimental and predicted Hi-C matrices
  • Spearman’s correlation of insulation score at each bin (computed using Cooltools calculate_insulation_score)

Those metrics reflect how well the model captures differences in 3D genome architecture caused by the rearrangement:

  • Ectopic interactions computed as in Simona Bianco et al. (2018). Briefly, we subtract WT Hi-C map from MUT Hi-C map, distance-normalize the results, and compute values which are 3 standard deviations from the mean of the distribution of the observed differences. Those outliers are designed as ectopic interactions.

To provide quantitative measurement of ectopic interactions overlap, we use visualization of Precision-Recall (PR) curves, output Area Under the Curve (AUC) metrics, and show the overlap of the predicted and experimentally measured ectopic interactions as compared to randomized controls:

  • Changes in insulation score. For calculating ectopic insulation score, we divide the insulation score (computed using CoolTools calculate_insulation_score) at each bin for WT and MUT conditions and divide one track by another element-wise. That gives us fold changes of the insulation score for each locus (bin).

    Two additional metrics are used to compare predicted and experimental Hi-C contacts with regard to genomic region datasets (Single):

    • Spearman’s correlation between experimental and predicted decay of contact frequency with genomic distance P(s).
    • Spearman’s correlation between experimental and predicted compartment strength computed as in Martin Falk et al. (2019).