openfold.data.parsers

Functions for parsing various file formats.

Classes

HitMetadata(pdb_id, chain, start, end, ...)

Msa(sequences, deletion_matrix, descriptions)

Class representing a parsed MSA file

TemplateHit(index, name, aligned_cols, ...)

Class representing a template hit.

Functions

convert_stockholm_to_a3m(stockholm_format[, ...])

Converts MSA in Stockholm format to the A3M format.

deduplicate_stockholm_msa(stockholm_msa)

Remove duplicate sequences (ignoring insertions wrt query).

parse_a3m(a3m_string)

Parses sequences and deletion matrix from a3m format alignment.

parse_e_values_from_tblout(tblout)

Parse target to e-value mapping parsed from Jackhmmer tblout string.

parse_fasta(fasta_string)

Parses FASTA string and returns list of strings with amino-acid sequences.

parse_hhr(hhr_string)

Parses the content of an entire HHR file.

parse_hmmsearch_a3m(query_sequence, a3m_string)

Parses an a3m string produced by hmmsearch.

parse_hmmsearch_sto(output_string, ...)

Gets parsed template hits from the raw string output by the tool.

parse_stockholm(stockholm_string)

Parses sequences and deletion matrix from stockholm format alignment.

remove_empty_columns_from_stockholm_msa(...)

Removes empty columns (dashes-only) from a Stockholm MSA.

truncate_stockholm_msa(stockholm_msa_path, ...)

Reads + truncates a Stockholm file while preventing excessive RAM usage.

class HitMetadata(pdb_id: str, chain: str, start: int, end: int, length: int, text: str)
Parameters:
chain: str
end: int
length: int
pdb_id: str
start: int
text: str
class Msa(sequences, deletion_matrix, descriptions)

Class representing a parsed MSA file

Parameters:
truncate(max_seqs)
Parameters:

max_seqs (int)

deletion_matrix: Sequence[Sequence[int]]
descriptions: Sequence[str] | None
sequences: Sequence[str]
class TemplateHit(index, name, aligned_cols, sum_probs, query, hit_sequence, indices_query, indices_hit)

Class representing a template hit.

Parameters:
aligned_cols: int
hit_sequence: str
index: int
indices_hit: List[int]
indices_query: List[int]
name: str
query: str
sum_probs: float | None
convert_stockholm_to_a3m(stockholm_format, max_sequences=None, remove_first_row_gaps=True)

Converts MSA in Stockholm format to the A3M format.

Parameters:
  • stockholm_format (str)

  • max_sequences (int | None)

  • remove_first_row_gaps (bool)

Return type:

str

deduplicate_stockholm_msa(stockholm_msa)

Remove duplicate sequences (ignoring insertions wrt query).

Parameters:

stockholm_msa (str)

Return type:

str

parse_a3m(a3m_string)

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string (str) – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

  • A list of sequences that have been aligned to the query. These

    might contain duplicates.

  • The deletion matrix for the alignment as a list of lists. The element

    at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

parse_e_values_from_tblout(tblout)

Parse target to e-value mapping parsed from Jackhmmer tblout string.

Parameters:

tblout (str)

Return type:

Dict[str, float]

parse_fasta(fasta_string)

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string (str) – The string contents of a FASTA file.

Returns:

  • A list of sequences.

  • A list of sequence descriptions taken from the comment lines. In the

    same order as the sequences.

Return type:

A tuple of two lists

parse_hhr(hhr_string)

Parses the content of an entire HHR file.

Parameters:

hhr_string (str)

Return type:

Sequence[TemplateHit]

parse_hmmsearch_a3m(query_sequence, a3m_string, skip_first=True)

Parses an a3m string produced by hmmsearch.

Parameters:
  • query_sequence (str) – The query sequence.

  • a3m_string (str) – The a3m string produced by hmmsearch.

  • skip_first (bool) – Whether to skip the first sequence in the a3m string.

Returns:

A sequence of TemplateHit results.

Return type:

Sequence[TemplateHit]

parse_hmmsearch_sto(output_string, input_sequence)

Gets parsed template hits from the raw string output by the tool.

Parameters:
  • output_string (str)

  • input_sequence (str)

Return type:

Sequence[TemplateHit]

parse_stockholm(stockholm_string)

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:

stockholm_string (str) – The string contents of a stockholm file. The first sequence in the file should be the query sequence.

Returns:

  • A list of sequences that have been aligned to the query. These

    might contain duplicates.

  • The deletion matrix for the alignment as a list of lists. The element

    at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

  • The names of the targets matched, including the jackhmmer subsequence

    suffix.

Return type:

A tuple of

remove_empty_columns_from_stockholm_msa(stockholm_msa)

Removes empty columns (dashes-only) from a Stockholm MSA.

Parameters:

stockholm_msa (str)

Return type:

str

truncate_stockholm_msa(stockholm_msa_path, max_sequences)

Reads + truncates a Stockholm file while preventing excessive RAM usage.

Parameters:
  • stockholm_msa_path (str)

  • max_sequences (int)

Return type:

str