openfold.data.data_pipeline

Classes

AlignmentRunner([jackhmmer_binary_path, ...])

Runs alignment tools and saves the results

DataPipeline(template_featurizer)

Assembles input features.

DataPipelineMultimer(monomer_data_pipeline)

Runs the alignment tools and assembles the input features.

Functions

add_assembly_features(all_chain_features)

Add features to distinguish between chains.

convert_monomer_features(monomer_features, ...)

Reshapes and modifies monomer features for multimer models.

int_id_to_str_id(num)

Encodes a number as a string, using reverse spreadsheet style naming.

make_dummy_msa_feats(input_sequence)

make_dummy_msa_obj(input_sequence)

make_mmcif_features(mmcif_object, chain_id)

make_msa_features(msas)

Constructs a feature dict of MSA features.

make_pdb_features(protein_object, description)

make_protein_features(protein_object, ...[, ...])

make_sequence_features(sequence, ...)

Construct a feature dict of sequence features.

make_sequence_features_with_custom_template(...)

process a single fasta file using features derived from a single template rather than an alignment

make_template_features(input_sequence, hits, ...)

pad_msa(np_example, min_num_seq)

run_msa_tool(msa_runner, fasta_path, ...[, ...])

Runs an MSA tool, checking if output already exists first.

temp_fasta_file(fasta_str)

unify_template_features(template_feature_list)

class AlignmentRunner(jackhmmer_binary_path=None, hhblits_binary_path=None, uniref90_database_path=None, mgnify_database_path=None, bfd_database_path=None, uniref30_database_path=None, uniclust30_database_path=None, uniprot_database_path=None, template_searcher=None, use_small_bfd=None, no_cpus=None, uniref_max_hits=10000, mgnify_max_hits=5000, uniprot_max_hits=50000)

Runs alignment tools and saves the results

Parameters:
  • jackhmmer_binary_path (str | None)

  • hhblits_binary_path (str | None)

  • uniref90_database_path (str | None)

  • mgnify_database_path (str | None)

  • bfd_database_path (str | None)

  • uniref30_database_path (str | None)

  • uniclust30_database_path (str | None)

  • uniprot_database_path (str | None)

  • template_searcher (HHSearch | Hmmsearch | None)

  • use_small_bfd (bool | None)

  • no_cpus (int | None)

  • uniref_max_hits (int)

  • mgnify_max_hits (int)

  • uniprot_max_hits (int)

run(fasta_path, output_dir)

Runs alignment tools on a sequence

Parameters:
  • fasta_path (str)

  • output_dir (str)

class DataPipeline(template_featurizer)

Assembles input features.

Parameters:

template_featurizer (TemplateHitFeaturizer | None)

process_core(core_path, alignment_dir, alignment_index=None, seqemb_mode=False)

Assembles features for a protein in a ProteinNet .core file.

Parameters:
  • core_path (str)

  • alignment_dir (str)

  • alignment_index (Any | None)

  • seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

process_fasta(fasta_path, alignment_dir, alignment_index=None, seqemb_mode=False)

Assembles features for a single sequence in a FASTA file

Parameters:
  • fasta_path (str)

  • alignment_dir (str)

  • alignment_index (Any | None)

  • seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

process_mmcif(mmcif, alignment_dir, chain_id=None, alignment_index=None, seqemb_mode=False)

Assembles features for a specific chain in an mmCIF object.

If chain_id is None, it is assumed that there is only one chain in the object. Otherwise, a ValueError is thrown.

Parameters:
Return type:

MutableMapping[str, ndarray]

process_multiseq_fasta(fasta_path, super_alignment_dir, ri_gap=200)

Assembles features for a multi-sequence FASTA. Uses Minkyung Baek’s hack from Twitter (a.k.a. AlphaFold-Gap).

Parameters:
  • fasta_path (str)

  • super_alignment_dir (str)

  • ri_gap (int)

Return type:

MutableMapping[str, ndarray]

process_pdb(pdb_path, alignment_dir, is_distillation=True, chain_id=None, _structure_index=None, alignment_index=None, seqemb_mode=False)

Assembles features for a protein in a PDB file.

Parameters:
  • pdb_path (str)

  • alignment_dir (str)

  • is_distillation (bool)

  • chain_id (str | None)

  • _structure_index (str | None)

  • alignment_index (Any | None)

  • seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

class DataPipelineMultimer(monomer_data_pipeline)

Runs the alignment tools and assembles the input features.

Parameters:

monomer_data_pipeline (DataPipeline)

get_mmcif_features(mmcif_object, chain_id)
Parameters:
Return type:

MutableMapping[str, ndarray]

process_fasta(fasta_path, alignment_dir, alignment_index=None)

Creates features.

Parameters:
  • fasta_path (str)

  • alignment_dir (str)

  • alignment_index (Any | None)

Return type:

MutableMapping[str, ndarray]

process_mmcif(mmcif, alignment_dir, alignment_index=None)
Parameters:
Return type:

MutableMapping[str, ndarray]

add_assembly_features(all_chain_features)

Add features to distinguish between chains.

Parameters:

all_chain_features (MutableMapping[str, MutableMapping[str, ndarray]]) – A dictionary which maps chain_id to a dictionary of features for each chain.

Returns:

A dictionary which maps strings of the form

<seq_id>_<sym_id> to the corresponding chain features. E.g. two chains from a homodimer would have keys A_1 and A_2. Two chains from a heterodimer would have keys A_1 and B_1.

Return type:

all_chain_features

convert_monomer_features(monomer_features, chain_id)

Reshapes and modifies monomer features for multimer models.

Parameters:
Return type:

MutableMapping[str, ndarray]

int_id_to_str_id(num)

Encodes a number as a string, using reverse spreadsheet style naming.

Parameters:

num (int) – A positive integer.

Returns:

A string that encodes the positive integer using reverse spreadsheet style, naming e.g. 1 = A, 2 = B, …, 27 = AA, 28 = BA, 29 = CA, … This is the usual way to encode chain IDs in mmCIF files.

Return type:

str

make_dummy_msa_feats(input_sequence)
Return type:

MutableMapping[str, ndarray]

make_dummy_msa_obj(input_sequence)
Return type:

Msa

make_mmcif_features(mmcif_object, chain_id)
Parameters:
Return type:

MutableMapping[str, ndarray]

make_msa_features(msas)

Constructs a feature dict of MSA features.

Parameters:

msas (Sequence[Msa])

Return type:

MutableMapping[str, ndarray]

make_pdb_features(protein_object, description, is_distillation=True, confidence_threshold=50.0)
Parameters:
  • protein_object (Protein)

  • description (str)

  • is_distillation (bool)

  • confidence_threshold (float)

Return type:

MutableMapping[str, ndarray]

make_protein_features(protein_object, description, _is_distillation=False)
Parameters:
Return type:

MutableMapping[str, ndarray]

make_sequence_features(sequence, description, num_res)

Construct a feature dict of sequence features.

Parameters:
  • sequence (str)

  • description (str)

  • num_res (int)

Return type:

MutableMapping[str, ndarray]

make_sequence_features_with_custom_template(sequence, mmcif_path, pdb_id, chain_id, kalign_binary_path)

process a single fasta file using features derived from a single template rather than an alignment

Parameters:
  • sequence (str)

  • mmcif_path (str)

  • pdb_id (str)

  • chain_id (str)

  • kalign_binary_path (str)

Return type:

MutableMapping[str, ndarray]

make_template_features(input_sequence, hits, template_featurizer)
Parameters:
Return type:

MutableMapping[str, ndarray]

pad_msa(np_example, min_num_seq)
run_msa_tool(msa_runner, fasta_path, msa_out_path, msa_format, max_sto_sequences=None)

Runs an MSA tool, checking if output already exists first.

Parameters:
  • fasta_path (str)

  • msa_out_path (str)

  • msa_format (str)

  • max_sto_sequences (int | None)

Return type:

Mapping[str, Any]

temp_fasta_file(fasta_str)
Parameters:

fasta_str (str)

unify_template_features(template_feature_list)
Parameters:

template_feature_list (Sequence[MutableMapping[str, ndarray]])

Return type:

MutableMapping[str, ndarray]