openfold.data.data_pipeline¶

Classes

`AlignmentRunner`([jackhmmer_binary_path, ...])	Runs alignment tools and saves the results
`DataPipeline`(template_featurizer)	Assembles input features.
`DataPipelineMultimer`(monomer_data_pipeline)	Runs the alignment tools and assembles the input features.

Functions

`add_assembly_features`(all_chain_features)	Add features to distinguish between chains.
`convert_monomer_features`(monomer_features, ...)	Reshapes and modifies monomer features for multimer models.
`int_id_to_str_id`(num)	Encodes a number as a string, using reverse spreadsheet style naming.
`make_dummy_msa_feats`(input_sequence)
`make_dummy_msa_obj`(input_sequence)
`make_mmcif_features`(mmcif_object, chain_id)
`make_msa_features`(msas)	Constructs a feature dict of MSA features.
`make_pdb_features`(protein_object, description)
`make_protein_features`(protein_object, ...[, ...])
`make_sequence_features`(sequence, ...)	Construct a feature dict of sequence features.
`make_sequence_features_with_custom_template`(...)	process a single fasta file using features derived from a single template rather than an alignment
`make_template_features`(input_sequence, hits, ...)
`pad_msa`(np_example, min_num_seq)
`run_msa_tool`(msa_runner, fasta_path, ...[, ...])	Runs an MSA tool, checking if output already exists first.
`temp_fasta_file`(fasta_str)
`unify_template_features`(template_feature_list)

class AlignmentRunner(jackhmmer_binary_path=None, hhblits_binary_path=None, uniref90_database_path=None, mgnify_database_path=None, bfd_database_path=None, uniref30_database_path=None, uniclust30_database_path=None, uniprot_database_path=None, template_searcher=None, use_small_bfd=None, no_cpus=None, uniref_max_hits=10000, mgnify_max_hits=5000, uniprot_max_hits=50000)¶

Runs alignment tools and saves the results

Parameters:

jackhmmer_binary_path (str | None)
hhblits_binary_path (str | None)
uniref90_database_path (str | None)
mgnify_database_path (str | None)
bfd_database_path (str | None)
uniref30_database_path (str | None)
uniclust30_database_path (str | None)
uniprot_database_path (str | None)
template_searcher (HHSearch | Hmmsearch | None)
use_small_bfd (bool | None)
no_cpus (int | None)
uniref_max_hits (int)
mgnify_max_hits (int)
uniprot_max_hits (int)

run(fasta_path, output_dir)¶

Runs alignment tools on a sequence

Parameters:

fasta_path (str)
output_dir (str)

class DataPipeline(template_featurizer)¶

Assembles input features.

Parameters:: template_featurizer (TemplateHitFeaturizer | None)

process_core(core_path, alignment_dir, alignment_index=None, seqemb_mode=False)¶

Assembles features for a protein in a ProteinNet .core file.

Parameters:

core_path (str)
alignment_dir (str)
alignment_index (Any | None)
seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

process_fasta(fasta_path, alignment_dir, alignment_index=None, seqemb_mode=False)¶

Assembles features for a single sequence in a FASTA file

Parameters:

fasta_path (str)
alignment_dir (str)
alignment_index (Any | None)
seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

process_mmcif(mmcif, alignment_dir, chain_id=None, alignment_index=None, seqemb_mode=False)¶

Assembles features for a specific chain in an mmCIF object.

If chain_id is None, it is assumed that there is only one chain in the object. Otherwise, a ValueError is thrown.

Parameters:

mmcif (MmcifObject)
alignment_dir (str)
chain_id (str | None)
alignment_index (Any | None)
seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

process_multiseq_fasta(fasta_path, super_alignment_dir, ri_gap=200)¶

Assembles features for a multi-sequence FASTA. Uses Minkyung Baek’s hack from Twitter (a.k.a. AlphaFold-Gap).

Parameters:

fasta_path (str)
super_alignment_dir (str)
ri_gap (int)

Return type:

MutableMapping[str, ndarray]

process_pdb(pdb_path, alignment_dir, is_distillation=True, chain_id=None, _structure_index=None, alignment_index=None, seqemb_mode=False)¶

Assembles features for a protein in a PDB file.

Parameters:

pdb_path (str)
alignment_dir (str)
is_distillation (bool)
chain_id (str | None)
_structure_index (str | None)
alignment_index (Any | None)
seqemb_mode (bool)

Return type:

MutableMapping[str, ndarray]

class DataPipelineMultimer(monomer_data_pipeline)¶

Runs the alignment tools and assembles the input features.

Parameters:: monomer_data_pipeline (DataPipeline)

get_mmcif_features(mmcif_object, chain_id)¶

Parameters:

mmcif_object (MmcifObject)
chain_id (str)

Return type:

MutableMapping[str, ndarray]

process_fasta(fasta_path, alignment_dir, alignment_index=None)¶

Creates features.

Parameters:

fasta_path (str)
alignment_dir (str)
alignment_index (Any | None)

Return type:

MutableMapping[str, ndarray]

process_mmcif(mmcif, alignment_dir, alignment_index=None)¶

Parameters:

mmcif (MmcifObject)
alignment_dir (str)
alignment_index (Any | None)

Return type:

MutableMapping[str, ndarray]

add_assembly_features(all_chain_features)¶

Add features to distinguish between chains.

Parameters:

all_chain_features (MutableMapping[str, MutableMapping[str, ndarray]]) – A dictionary which maps chain_id to a dictionary of features for each chain.

Returns:

A dictionary which maps strings of the form: <seq_id>_<sym_id> to the corresponding chain features. E.g. two chains from a homodimer would have keys A_1 and A_2. Two chains from a heterodimer would have keys A_1 and B_1.

Return type:

all_chain_features

convert_monomer_features(monomer_features, chain_id)¶

Reshapes and modifies monomer features for multimer models.

Parameters:

monomer_features (MutableMapping[str, ndarray])
chain_id (str)

Return type:

MutableMapping[str, ndarray]

int_id_to_str_id(num)¶

Encodes a number as a string, using reverse spreadsheet style naming.

Parameters:: num (int) – A positive integer.
Returns:: A string that encodes the positive integer using reverse spreadsheet style, naming e.g. 1 = A, 2 = B, …, 27 = AA, 28 = BA, 29 = CA, … This is the usual way to encode chain IDs in mmCIF files.
Return type:: str

make_dummy_msa_feats(input_sequence)¶

Return type:: MutableMapping[str, ndarray]

make_dummy_msa_obj(input_sequence)¶

Return type:: Msa

make_mmcif_features(mmcif_object, chain_id)¶

Parameters:

mmcif_object (MmcifObject)
chain_id (str)

Return type:

MutableMapping[str, ndarray]

make_msa_features(msas)¶

Constructs a feature dict of MSA features.

Parameters:: msas (Sequence[Msa])
Return type:: MutableMapping[str, ndarray]

make_pdb_features(protein_object, description, is_distillation=True, confidence_threshold=50.0)¶

Parameters:

protein_object (Protein)
description (str)
is_distillation (bool)
confidence_threshold (float)

Return type:

MutableMapping[str, ndarray]

make_protein_features(protein_object, description, _is_distillation=False)¶

Parameters:

protein_object (Protein)
description (str)
_is_distillation (bool)

Return type:

MutableMapping[str, ndarray]

make_sequence_features(sequence, description, num_res)¶

Construct a feature dict of sequence features.

Parameters:

sequence (str)
description (str)
num_res (int)

Return type:

MutableMapping[str, ndarray]

make_sequence_features_with_custom_template(sequence, mmcif_path, pdb_id, chain_id, kalign_binary_path)¶

process a single fasta file using features derived from a single template rather than an alignment

Parameters:

sequence (str)
mmcif_path (str)
pdb_id (str)
chain_id (str)
kalign_binary_path (str)

Return type:

MutableMapping[str, ndarray]

make_template_features(input_sequence, hits, template_featurizer)¶

Parameters:

input_sequence (str)
hits (Sequence[Any])
template_featurizer (Any)

Return type:

MutableMapping[str, ndarray]

pad_msa(np_example, min_num_seq)¶

run_msa_tool(msa_runner, fasta_path, msa_out_path, msa_format, max_sto_sequences=None)¶

Runs an MSA tool, checking if output already exists first.

Parameters:

fasta_path (str)
msa_out_path (str)
msa_format (str)
max_sto_sequences (int | None)

Return type:

Mapping[str, Any]

temp_fasta_file(fasta_str)¶

Parameters:: fasta_str (str)

unify_template_features(template_feature_list)¶

Parameters:: template_feature_list (Sequence[MutableMapping[str, ndarray]])
Return type:: MutableMapping[str, ndarray]