openfold.data.data_transforms_multimer¶

Functions

`build_extra_msa_feat`(batch)	Expand extra_msa into 1hot and concat with other extra msa features.
`create_msa_feat`(batch)	Create and concatenate MSA features.
`create_target_feat`(batch)	Create the target features
`get_contiguous_crop_idx`(protein, crop_size, ...)
`get_interface_residues`(positions, atom_mask, ...)
`get_spatial_crop_idx`(protein, crop_size, ...)
`gumbel_argsort_sample_idx`(logits[, generator])	Samples with replacement from a distribution given by 'logits'.
`gumbel_max_sample`(logits[, generator])	Samples from a probability distribution given by 'logits'.
`gumbel_noise`(shape, device[, eps, generator])	Generate Gumbel Noise of given Shape.
`make_masked_msa`(batch, config, ...[, eps])	Create data for BERT on raw MSA.
`make_msa_profile`(batch)	Compute the MSA profile.
`nearest_neighbor_clusters`(batch[, ...])	Assign each extra MSA sequence to its nearest neighbor in sampled MSA.
`randint`(lower, upper, generator, device)
`random_crop_to_size`(protein, crop_size, ...)	Crop randomly to crop_size, or keep as is if shorter than that.
`sample_msa`(batch, max_seq, ...[, inf])	Sample MSA randomly, remaining sequences are stored as extra_*.

build_extra_msa_feat(batch)¶

Expand extra_msa into 1hot and concat with other extra msa features.

We do this as late as possible as the one_hot extra msa can be very large.

Parameters:

batch –
a dictionary with the following keys: * ‘extra_msa’: [num_seq, num_res] MSA that wasn’t selected as a cluster

centre. Note - This isn’t one-hotted.
- ’extra_deletion_matrix’: [num_seq, num_res] Number of deletions at given
  position.
num_extra_msa – Number of extra msa to use.

Returns:

Concatenated tensor of extra MSA features.

create_msa_feat(batch)¶: Create and concatenate MSA features.

create_target_feat(batch)¶: Create the target features

get_contiguous_crop_idx(protein, crop_size, generator)¶

get_interface_residues(positions, atom_mask, asym_id, interface_threshold)¶

get_spatial_crop_idx(protein, crop_size, interface_threshold, generator)¶

gumbel_argsort_sample_idx(logits, generator=None)¶

Samples with replacement from a distribution given by ‘logits’.

This uses Gumbel trick to implement the sampling an efficient manner. For a distribution over k items this samples k times without replacement, so this is effectively sampling a random permutation with probabilities over the permutations derived from the logprobs.

Parameters:: logits (Tensor) – Logarithm of probabilities to sample from, probabilities can be unnormalized.
Returns:: Sample from logprobs in one-hot form.
Return type:: Tensor

gumbel_max_sample(logits, generator=None)¶

Samples from a probability distribution given by ‘logits’.

This uses Gumbel-max trick to implement the sampling in an efficient manner.

Parameters:: logits (Tensor) – Logarithm of probabilities to sample from, probabilities can be unnormalized.
Returns:: Sample from logprobs in one-hot form.
Return type:: Tensor

gumbel_noise(shape, device, eps=1e-06, generator=None)¶

Generate Gumbel Noise of given Shape.

This generates samples from Gumbel(0, 1).

Parameters:

shape (Sequence[int]) – Shape of noise to return.
device (device)

Returns:

Gumbel noise of given shape.

Return type:

Tensor

make_masked_msa(batch, config, replace_fraction, seed, eps=1e-06)¶: Create data for BERT on raw MSA.

make_msa_profile(batch)¶: Compute the MSA profile.

nearest_neighbor_clusters(batch, gap_agreement_weight=0.0)¶: Assign each extra MSA sequence to its nearest neighbor in sampled MSA.

randint(lower, upper, generator, device)¶

random_crop_to_size(protein, crop_size, max_templates, shape_schema, spatial_crop_prob, interface_threshold, subsample_templates=False, seed=None)¶: Crop randomly to crop_size, or keep as is if shorter than that.

sample_msa(batch, max_seq, max_extra_msa_seq, seed, inf=1000000.0)¶

Sample MSA randomly, remaining sequences are stored as extra_*.

Parameters:

batch – batch to sample msa from.
max_seq – number of sequences to sample.

Returns:

Protein with sampled msa.