openfold.data.data_modules

Classes

DummyDataLoader(batch_path)

DummyDataset(batch_path)

OpenFoldBatchCollator()

OpenFoldDataLoader(*args, config[, stage, ...])

OpenFoldDataModule(config, ...[, ...])

OpenFoldDataset(datasets, probabilities, ...)

Implements the stochastic filters applied during AlphaFold's training.

OpenFoldMultimerDataModule(config, ...[, ...])

Create a datamodule specifically for multimer training

OpenFoldMultimerDataset(datasets, ...[, ...])

Create a torch Dataset object for multimer training and add filtering steps described in AlphaFold Multimer's paper: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1

OpenFoldSingleDataset(data_dir, ...[, ...])

OpenFoldSingleMultimerDataset(data_dir, ...)

Functions

aa_count_filter(seqs, max_single_aa_prop)

Check if any single amino acid accounts for more than max_single_aa_prop percent of the sequence(s)

all_seq_len_filter(seqs, ...)

Check if the total combined sequence lengths are >= minimum_numer_of_residues

resolution_filter(resolution, max_resolution)

Check that the resolution is <= max_resolution permitted

class DummyDataLoader(batch_path)

Bases: LightningDataModule

train_dataloader()
class DummyDataset(batch_path)

Bases: Dataset

class OpenFoldBatchCollator
class OpenFoldDataLoader(*args, config, stage='train', generator=None, **kwargs)

Bases: DataLoader

class OpenFoldDataModule(config, template_mmcif_dir, max_template_date, train_data_dir=None, train_alignment_dir=None, train_chain_data_cache_path=None, distillation_data_dir=None, distillation_alignment_dir=None, distillation_chain_data_cache_path=None, val_data_dir=None, val_alignment_dir=None, predict_data_dir=None, predict_alignment_dir=None, kalign_binary_path='/usr/bin/kalign', train_filter_path=None, distillation_filter_path=None, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, batch_seed=None, train_epoch_len=50000, _distillation_structure_index_path=None, alignment_index_path=None, distillation_alignment_index_path=None, **kwargs)

Bases: LightningDataModule

Parameters:
  • config (ConfigDict)

  • template_mmcif_dir (str)

  • max_template_date (str)

  • train_data_dir (str | None)

  • train_alignment_dir (str | None)

  • train_chain_data_cache_path (str | None)

  • distillation_data_dir (str | None)

  • distillation_alignment_dir (str | None)

  • distillation_chain_data_cache_path (str | None)

  • val_data_dir (str | None)

  • val_alignment_dir (str | None)

  • predict_data_dir (str | None)

  • predict_alignment_dir (str | None)

  • kalign_binary_path (str)

  • train_filter_path (str | None)

  • distillation_filter_path (str | None)

  • obsolete_pdbs_file_path (str | None)

  • template_release_dates_cache_path (str | None)

  • batch_seed (int | None)

  • train_epoch_len (int)

  • _distillation_structure_index_path (str | None)

  • alignment_index_path (str | None)

  • distillation_alignment_index_path (str | None)

predict_dataloader()
setup(stage=None)
train_dataloader()
val_dataloader()
class OpenFoldDataset(datasets, probabilities, epoch_len, generator=None, _roll_at_init=True)

Bases: Dataset

Implements the stochastic filters applied during AlphaFold’s training. Because samples are selected from constituent datasets randomly, the length of an OpenFoldFilteredDataset is arbitrary. Samples are selected and filtered once at initialization.

Parameters:
static deterministic_train_filter(cache_entry, max_resolution=9.0, max_single_aa_prop=0.8, *args, **kwargs)
Parameters:
  • cache_entry (Any)

  • max_resolution (float)

  • max_single_aa_prop (float)

Return type:

bool

static get_stochastic_train_filter_prob(cache_entry, *args, **kwargs)
Parameters:

cache_entry (Any)

Return type:

float

looped_samples(dataset_idx)
looped_shuffled_dataset_idx(dataset_len)
reroll()
class OpenFoldMultimerDataModule(config, template_mmcif_dir, max_template_date, train_data_dir=None, train_mmcif_data_cache_path=None, val_mmcif_data_cache_path=None, **kwargs)

Bases: OpenFoldDataModule

Create a datamodule specifically for multimer training

Compared to OpenFoldDataModule, OpenFoldMultimerDataModule requires mmcif_data_cache_path which is the product of scripts/generate_mmcif_cache.py mmcif_data_cache_path should be a file that record what chain(s) each mmcif file has

Parameters:
  • config (ConfigDict)

  • template_mmcif_dir (str)

  • max_template_date (str)

  • train_data_dir (str | None)

  • train_mmcif_data_cache_path (str | None)

  • val_mmcif_data_cache_path (str | None)

setup(setup=None)
class OpenFoldMultimerDataset(datasets, probabilities, epoch_len, generator=None, _roll_at_init=True)

Bases: OpenFoldDataset

Create a torch Dataset object for multimer training and add filtering steps described in AlphaFold Multimer’s paper: https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1

Parameters:
static deterministic_train_filter(cache_entry, is_distillation, max_resolution=9.0, max_single_aa_prop=0.8, minimum_number_of_residues=200, *args, **kwargs)

Implement multimer training filtering criteria described in https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2.full.pdf Supplementary section 7.1

Parameters:
  • cache_entry (Any)

  • is_distillation (bool)

  • max_resolution (float)

  • max_single_aa_prop (float)

  • minimum_number_of_residues (int)

Return type:

bool

static get_stochastic_train_filter_prob(cache_entry, *args, **kwargs)
Parameters:

cache_entry (Any)

Return type:

list

looped_samples(dataset_idx)
class OpenFoldSingleDataset(data_dir, alignment_dir, template_mmcif_dir, max_template_date, config, chain_data_cache_path=None, kalign_binary_path='/usr/bin/kalign', max_template_hits=4, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, shuffle_top_k_prefiltered=None, treat_pdb_as_distillation=True, filter_path=None, mode='train', alignment_index=None, _output_raw=False, _structure_index=None)

Bases: Dataset

Parameters:
  • data_dir (str)

  • alignment_dir (str)

  • template_mmcif_dir (str)

  • max_template_date (str)

  • config (ConfigDict)

  • chain_data_cache_path (str | None)

  • kalign_binary_path (str)

  • max_template_hits (int)

  • obsolete_pdbs_file_path (str | None)

  • template_release_dates_cache_path (str | None)

  • shuffle_top_k_prefiltered (int | None)

  • treat_pdb_as_distillation (bool)

  • filter_path (str | None)

  • mode (str)

  • alignment_index (Any | None)

  • _output_raw (bool)

  • _structure_index (Any | None)

chain_id_to_idx(chain_id)
idx_to_chain_id(idx)
class OpenFoldSingleMultimerDataset(data_dir, alignment_dir, template_mmcif_dir, max_template_date, config, mmcif_data_cache_path=None, kalign_binary_path='/usr/bin/kalign', max_template_hits=4, obsolete_pdbs_file_path=None, template_release_dates_cache_path=None, shuffle_top_k_prefiltered=None, treat_pdb_as_distillation=True, filter_path=None, mode='train', alignment_index=None, _output_raw=False, _structure_index=None)

Bases: Dataset

Parameters:
  • data_dir (str)

  • alignment_dir (str)

  • template_mmcif_dir (str)

  • max_template_date (str)

  • config (ConfigDict)

  • mmcif_data_cache_path (str | None)

  • kalign_binary_path (str)

  • max_template_hits (int)

  • obsolete_pdbs_file_path (str | None)

  • template_release_dates_cache_path (str | None)

  • shuffle_top_k_prefiltered (int | None)

  • treat_pdb_as_distillation (bool)

  • filter_path (str | None)

  • mode (str)

  • alignment_index (Any | None)

  • _output_raw (bool)

  • _structure_index (Any | None)

idx_to_mmcif_id(idx)
mmcif_id_to_idx(mmcif_id)
aa_count_filter(seqs, max_single_aa_prop)

Check if any single amino acid accounts for more than max_single_aa_prop percent of the sequence(s)

Parameters:
Return type:

bool

all_seq_len_filter(seqs, minimum_number_of_residues)

Check if the total combined sequence lengths are >= minimum_numer_of_residues

Parameters:
  • seqs (list)

  • minimum_number_of_residues (int)

Return type:

bool

resolution_filter(resolution, max_resolution)

Check that the resolution is <= max_resolution permitted

Parameters:
  • resolution (int)

  • max_resolution (float)

Return type:

bool