openfold.data.data_modules¶
Classes
|
|
|
|
|
|
|
|
|
Implements the stochastic filters applied during AlphaFold's training. |
|
Create a datamodule specifically for multimer training |
|
Create a torch Dataset object for multimer training and add filtering steps described in AlphaFold Multimer's paper: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1 |
|
|
|
Functions
|
Check if any single amino acid accounts for more than max_single_aa_prop percent of the sequence(s) |
|
Check if the total combined sequence lengths are >= minimum_numer_of_residues |
|
Check that the resolution is <= max_resolution permitted |
- class OpenFoldBatchCollator¶
- class OpenFoldDataLoader(*args, config, stage='train', generator=None, **kwargs)¶
Bases:
DataLoader
- class OpenFoldDataModule(config, template_mmcif_dir, max_template_date, train_data_dir=None, train_alignment_dir=None, train_chain_data_cache_path=None, distillation_data_dir=None, distillation_alignment_dir=None, distillation_chain_data_cache_path=None, val_data_dir=None, val_alignment_dir=None, predict_data_dir=None, predict_alignment_dir=None, kalign_binary_path='/usr/bin/kalign', train_filter_path=None, distillation_filter_path=None, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, batch_seed=None, train_epoch_len=50000, _distillation_structure_index_path=None, alignment_index_path=None, distillation_alignment_index_path=None, **kwargs)¶
Bases:
LightningDataModule- Parameters:
config (ConfigDict)
template_mmcif_dir (str)
max_template_date (str)
train_data_dir (str | None)
train_alignment_dir (str | None)
train_chain_data_cache_path (str | None)
distillation_data_dir (str | None)
distillation_alignment_dir (str | None)
distillation_chain_data_cache_path (str | None)
val_data_dir (str | None)
val_alignment_dir (str | None)
predict_data_dir (str | None)
predict_alignment_dir (str | None)
kalign_binary_path (str)
train_filter_path (str | None)
distillation_filter_path (str | None)
obsolete_pdbs_file_path (str | None)
template_release_dates_cache_path (str | None)
batch_seed (int | None)
train_epoch_len (int)
_distillation_structure_index_path (str | None)
alignment_index_path (str | None)
distillation_alignment_index_path (str | None)
- predict_dataloader()¶
- setup(stage=None)¶
- train_dataloader()¶
- val_dataloader()¶
- class OpenFoldDataset(datasets, probabilities, epoch_len, generator=None, _roll_at_init=True)¶
Bases:
DatasetImplements the stochastic filters applied during AlphaFold’s training. Because samples are selected from constituent datasets randomly, the length of an OpenFoldFilteredDataset is arbitrary. Samples are selected and filtered once at initialization.
- Parameters:
datasets (Sequence[OpenFoldSingleDataset] | Sequence[OpenFoldSingleMultimerDataset])
epoch_len (int)
generator (Generator)
_roll_at_init (bool)
- static deterministic_train_filter(cache_entry, max_resolution=9.0, max_single_aa_prop=0.8, *args, **kwargs)¶
- static get_stochastic_train_filter_prob(cache_entry, *args, **kwargs)¶
- looped_samples(dataset_idx)¶
- looped_shuffled_dataset_idx(dataset_len)¶
- reroll()¶
- class OpenFoldMultimerDataModule(config, template_mmcif_dir, max_template_date, train_data_dir=None, train_mmcif_data_cache_path=None, val_mmcif_data_cache_path=None, **kwargs)¶
Bases:
OpenFoldDataModuleCreate a datamodule specifically for multimer training
Compared to OpenFoldDataModule, OpenFoldMultimerDataModule requires mmcif_data_cache_path which is the product of scripts/generate_mmcif_cache.py mmcif_data_cache_path should be a file that record what chain(s) each mmcif file has
- Parameters:
- setup(setup=None)¶
- class OpenFoldMultimerDataset(datasets, probabilities, epoch_len, generator=None, _roll_at_init=True)¶
Bases:
OpenFoldDatasetCreate a torch Dataset object for multimer training and add filtering steps described in AlphaFold Multimer’s paper: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1
- Parameters:
- static deterministic_train_filter(cache_entry, is_distillation, max_resolution=9.0, max_single_aa_prop=0.8, minimum_number_of_residues=200, *args, **kwargs)¶
Implement multimer training filtering criteria described in https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1
- static get_stochastic_train_filter_prob(cache_entry, *args, **kwargs)¶
- looped_samples(dataset_idx)¶
- class OpenFoldSingleDataset(data_dir, alignment_dir, template_mmcif_dir, max_template_date, config, chain_data_cache_path=None, kalign_binary_path='/usr/bin/kalign', max_template_hits=4, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, shuffle_top_k_prefiltered=None, treat_pdb_as_distillation=True, filter_path=None, mode='train', alignment_index=None, _output_raw=False, _structure_index=None)¶
Bases:
Dataset- Parameters:
data_dir (str)
alignment_dir (str)
template_mmcif_dir (str)
max_template_date (str)
config (ConfigDict)
chain_data_cache_path (str | None)
kalign_binary_path (str)
max_template_hits (int)
obsolete_pdbs_file_path (str | None)
template_release_dates_cache_path (str | None)
shuffle_top_k_prefiltered (int | None)
treat_pdb_as_distillation (bool)
filter_path (str | None)
mode (str)
alignment_index (Any | None)
_output_raw (bool)
_structure_index (Any | None)
- chain_id_to_idx(chain_id)¶
- idx_to_chain_id(idx)¶
- class OpenFoldSingleMultimerDataset(data_dir, alignment_dir, template_mmcif_dir, max_template_date, config, mmcif_data_cache_path=None, kalign_binary_path='/usr/bin/kalign', max_template_hits=4, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, shuffle_top_k_prefiltered=None, treat_pdb_as_distillation=True, filter_path=None, mode='train', alignment_index=None, _output_raw=False, _structure_index=None)¶
Bases:
Dataset- Parameters:
data_dir (str)
alignment_dir (str)
template_mmcif_dir (str)
max_template_date (str)
config (ConfigDict)
mmcif_data_cache_path (str | None)
kalign_binary_path (str)
max_template_hits (int)
obsolete_pdbs_file_path (str | None)
template_release_dates_cache_path (str | None)
shuffle_top_k_prefiltered (int | None)
treat_pdb_as_distillation (bool)
filter_path (str | None)
mode (str)
alignment_index (Any | None)
_output_raw (bool)
_structure_index (Any | None)
- idx_to_mmcif_id(idx)¶
- mmcif_id_to_idx(mmcif_id)¶
- aa_count_filter(seqs, max_single_aa_prop)¶
Check if any single amino acid accounts for more than max_single_aa_prop percent of the sequence(s)
- all_seq_len_filter(seqs, minimum_number_of_residues)¶
Check if the total combined sequence lengths are >= minimum_numer_of_residues