openfold.data.parsers¶

Functions for parsing various file formats.

Classes

`HitMetadata`(pdb_id, chain, start, end, ...)
`Msa`(sequences, deletion_matrix, descriptions)	Class representing a parsed MSA file
`TemplateHit`(index, name, aligned_cols, ...)	Class representing a template hit.

Functions

`convert_stockholm_to_a3m`(stockholm_format[, ...])	Converts MSA in Stockholm format to the A3M format.
`deduplicate_stockholm_msa`(stockholm_msa)	Remove duplicate sequences (ignoring insertions wrt query).
`parse_a3m`(a3m_string)	Parses sequences and deletion matrix from a3m format alignment.
`parse_e_values_from_tblout`(tblout)	Parse target to e-value mapping parsed from Jackhmmer tblout string.
`parse_fasta`(fasta_string)	Parses FASTA string and returns list of strings with amino-acid sequences.
`parse_hhr`(hhr_string)	Parses the content of an entire HHR file.
`parse_hmmsearch_a3m`(query_sequence, a3m_string)	Parses an a3m string produced by hmmsearch.
`parse_hmmsearch_sto`(output_string, ...)	Gets parsed template hits from the raw string output by the tool.
`parse_stockholm`(stockholm_string)	Parses sequences and deletion matrix from stockholm format alignment.
`remove_empty_columns_from_stockholm_msa`(...)	Removes empty columns (dashes-only) from a Stockholm MSA.
`truncate_stockholm_msa`(stockholm_msa_path, ...)	Reads + truncates a Stockholm file while preventing excessive RAM usage.

class HitMetadata(pdb_id: str, chain: str, start: int, end: int, length: int, text: str)¶

Parameters:

pdb_id (str)
chain (str)
start (int)
end (int)
length (int)
text (str)

chain: str¶

end: int¶

length: int¶

pdb_id: str¶

start: int¶

text: str¶

class Msa(sequences, deletion_matrix, descriptions)¶

Class representing a parsed MSA file

Parameters:

sequences (Sequence[str])
deletion_matrix (Sequence[Sequence[int]])
descriptions (Sequence[str] | None)

truncate(max_seqs)¶

Parameters:: max_seqs (int)

deletion_matrix: Sequence[Sequence[int]]¶

descriptions: Sequence[str] | None¶

sequences: Sequence[str]¶

class TemplateHit(index, name, aligned_cols, sum_probs, query, hit_sequence, indices_query, indices_hit)¶

Class representing a template hit.

Parameters:

index (int)
name (str)
aligned_cols (int)
sum_probs (float | None)
query (str)
hit_sequence (str)
indices_query (List[int])
indices_hit (List[int])

aligned_cols: int¶

hit_sequence: str¶

index: int¶

indices_hit: List[int]¶

indices_query: List[int]¶

name: str¶

query: str¶

sum_probs: float | None¶

convert_stockholm_to_a3m(stockholm_format, max_sequences=None, remove_first_row_gaps=True)¶

Converts MSA in Stockholm format to the A3M format.

Parameters:

stockholm_format (str)
max_sequences (int | None)
remove_first_row_gaps (bool)

Return type:

str

deduplicate_stockholm_msa(stockholm_msa)¶

Remove duplicate sequences (ignoring insertions wrt query).

Parameters:: stockholm_msa (str)
Return type:: str

parse_a3m(a3m_string)¶

Parses sequences and deletion matrix from a3m format alignment.

Parameters:

a3m_string (str) – The string contents of a a3m file. The first sequence in the file should be the query sequence.

Returns:

A list of sequences that have been aligned to the query. These
might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element
at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.

Return type:

A tuple of

parse_e_values_from_tblout(tblout)¶

Parse target to e-value mapping parsed from Jackhmmer tblout string.

Parameters:: tblout (str)
Return type:: Dict[str, float]

parse_fasta(fasta_string)¶

Parses FASTA string and returns list of strings with amino-acid sequences.

Parameters:

fasta_string (str) – The string contents of a FASTA file.

Returns:

A list of sequences.
A list of sequence descriptions taken from the comment lines. In the
same order as the sequences.

Return type:

A tuple of two lists

parse_hhr(hhr_string)¶

Parses the content of an entire HHR file.

Parameters:: hhr_string (str)
Return type:: Sequence[TemplateHit]

parse_hmmsearch_a3m(query_sequence, a3m_string, skip_first=True)¶

Parses an a3m string produced by hmmsearch.

Parameters:

query_sequence (str) – The query sequence.
a3m_string (str) – The a3m string produced by hmmsearch.
skip_first (bool) – Whether to skip the first sequence in the a3m string.

Returns:

A sequence of TemplateHit results.

Return type:

Sequence[TemplateHit]

parse_hmmsearch_sto(output_string, input_sequence)¶

Gets parsed template hits from the raw string output by the tool.

Parameters:

output_string (str)
input_sequence (str)

Return type:

Sequence[TemplateHit]

parse_stockholm(stockholm_string)¶

Parses sequences and deletion matrix from stockholm format alignment.

Parameters:

stockholm_string (str) – The string contents of a stockholm file. The first sequence in the file should be the query sequence.

Returns:

A list of sequences that have been aligned to the query. These
might contain duplicates.
The deletion matrix for the alignment as a list of lists. The element
at deletion_matrix[i][j] is the number of residues deleted from the aligned sequence i at residue position j.
The names of the targets matched, including the jackhmmer subsequence
suffix.

Return type:

A tuple of

remove_empty_columns_from_stockholm_msa(stockholm_msa)¶

Removes empty columns (dashes-only) from a Stockholm MSA.

Parameters:: stockholm_msa (str)
Return type:: str

truncate_stockholm_msa(stockholm_msa_path, max_sequences)¶

Reads + truncates a Stockholm file while preventing excessive RAM usage.

Parameters:

stockholm_msa_path (str)
max_sequences (int)

Return type:

str