3DGenBench is a web server for scoring performance of 3D genomic models. 3DGenBench provides two challenges. The first challenge aims at quantifying how accurate a model predicts experimental data. The second benchmark aims to estimate how well a model can predict changes in chromosome folding caused by structural genomic mutations.
There are five steps required to obtain 3DGenBench scores:
- Explore reference Hi-C dataset
- Generate computational predictions of Hi-C contacts for one or multiple samples (see example data)
- Upload your predictions to 3DGenBench server
- Provide samples metadata and compute metrics
- Explore metrics
There are two main dataset types for prediction. Rearrangement dataset (hereafter Paired) contains capture Hi-C and complementary epigenetic data, such as CTCF ChIP-seq, for wild-type and mutated samples (Homo sapiens, Mus musculus, Drosophila melanogaster cell lines). Genomic region dataset (hereafter Single) contains loci for prediction larger than 10 Mbp without known chromosome rearrangements. Those datasets can be found here.
Sample metadata include the following information:
start prediction and
end prediction columns describe the genomic region for which Hi-C interactions are expected to be predicted. Where
start prediction bin corresponds to the interval (start_prediction-resolution)-start_prediction, the same rule is for
end prediction bin.
Rearr #n Start,
Rearr #n end columns describe the rearrangement coordinates.
For each sample there are several columns for rearrangement coordinates if multiple simultaneous mutations have been found in the region.
The rearrangement type can be found in
Rearrangement Type column.
Also pay attention to the sample cell type, genomic assembly used, and available Hi-C map resolutions (5, 10, 20, 25, or 50 kb).
Hi-C maps for wild-type and mutated conditions are available in the most commonly used formats: hic, cool (for 5, 10, 20, 25, or 50 kb resolution), and pairs.
Also, for most datasets we provided supplementary tracks describing CTCF binding.
All the data can be downloaded via hyperlinks in the table.
If you need to all available Hi-C data for one particular sample, please follow links in
WT Archived Data or
MUT Archived Data columns.
Also you can explore dataset folder at our local FTP storage using hyperlinks in
WT FTP Folder or
MUT FTP Folder columns.
The detailed description of files can be found here.
If you need to download the entire Hi-C data set, use command:
wget -r -np https://genedev.bionet.nsc.ru/hic_out/by_Project/INC_COST_3DBenchmark/hic_dataset_zipped/
Also, you can download CTCF data in narrowPeak data format using links in
CTCF Data column.
These files have 2 additional columns with information about CTCF binding site orientation calculated using GimmeMotifs.
If you want to download the entire CTCF data set, use command:
wget -r -np https://genedev.bionet.nsc.ru/hic_out/by_Project/INC_COST_3DBenchmark/CTCF_data/
Use your computational model to predict Hi-C contacts or insulation score data for one of the reference samples.
Hi-C Contacts Input
The predicted list of contacts should be provided as a tab-separated values (TSV) file which contains the following columns:
chr contact_start contact_end contact_count
Where contact_start corresponds to the interval (contact_start - resolution)-contact_start, the same is for the contact_end.
An example file can be downloaded here.
Insulation Scores Input
The predicted insulation score track should be provided as a BedGraph file without header (technically a BedGraph-like TSV file). Columns are the following:
chrom chrom_start chrom_end insulation_score
For Paired benchmark two predicted tracks should be provided, both for WT and Mutated samples.
Protocol: SFTP Host name: gate1.cytogen.ru Port number: 8046 Username: sftp_user Password: 3DGenBench
Once the data is uploaded, go here, choose the type of prediction (Single, Paired, or insulation score-only for both types), then fill the form according to labels. You can use button to load example of predicted contacts file. Alternatively, example samples can be loaded as shown in the figure below.
The page allows you to submit predictions for several samples using button.
The status of the submission is available at the link in success message, or here by ID. Cyan status of submission indicates your job is queued, orange means your job is running at the server, green status shows that the job was successfully completed, and red indicates that there was an error. If your job has failed, you can try and read job logs (at the bottom of job page), or you can contact us.
If the job ends successfully, you will see the metrics page. Those metrics describe prediction accuracy of your model (see the section below). You can find the example of computed metrics checking any ID with the green status or submitting a test unit as described above.
Those metrics reflect how well the model predicts experimental Hi-C data:
- Spearman’s correlation between experimental and predicted Hi-C matrices
- SCC (stratum adjusted correlation coefficient) from Yang et al. (2017), implemented by hicreppy with max_dist parameter equals to 1500000, between experimental and predicted Hi-C matrices
- Spearman’s correlation of insulation score at each bin (computed using Cooltools calculate_insulation_score)
Those metrics reflect how well the model captures differences in 3D genome architecture caused by the rearrangement:
- Ectopic interactions computed as in Simona Bianco et al. (2018). Briefly, we subtract WT Hi-C map from MUT Hi-C map, distance-normalize the results, and compute values which are 3 standard deviations from the mean of the distribution of the observed differences. Those outliers are designed as ectopic interactions.
To provide quantitative measurement of ectopic interactions overlap, we use visualization of Precision-Recall (PR) curves, output Area Under the Curve (AUC) metrics, and show the overlap of the predicted and experimentally measured ectopic interactions as compared to randomized controls:
- Changes in insulation score.
For calculating ectopic insulation score, we divide the insulation score (computed using CoolTools calculate_insulation_score) at each bin for WT and MUT conditions and divide one track by another element-wise.
That gives us fold changes of the insulation score for each locus (bin).
Two additional metrics are used to compare predicted and experimental Hi-C contacts with regard to genomic region datasets (Single):
- Spearman’s correlation between experimental and predicted decay of contact frequency with genomic distance P(s).
- Spearman’s correlation between experimental and predicted compartment strength computed as in Martin Falk et al. (2019).