openfold.data.data_transforms_multimer

Functions

build_extra_msa_feat(batch)

Expand extra_msa into 1hot and concat with other extra msa features.

create_msa_feat(batch)

Create and concatenate MSA features.

create_target_feat(batch)

Create the target features

get_contiguous_crop_idx(protein, crop_size, ...)

get_interface_residues(positions, atom_mask, ...)

get_spatial_crop_idx(protein, crop_size, ...)

gumbel_argsort_sample_idx(logits[, generator])

Samples with replacement from a distribution given by 'logits'.

gumbel_max_sample(logits[, generator])

Samples from a probability distribution given by 'logits'.

gumbel_noise(shape, device[, eps, generator])

Generate Gumbel Noise of given Shape.

make_masked_msa(batch, config, ...[, eps])

Create data for BERT on raw MSA.

make_msa_profile(batch)

Compute the MSA profile.

nearest_neighbor_clusters(batch[, ...])

Assign each extra MSA sequence to its nearest neighbor in sampled MSA.

randint(lower, upper, generator, device)

random_crop_to_size(protein, crop_size, ...)

Crop randomly to crop_size, or keep as is if shorter than that.

sample_msa(batch, max_seq, ...[, inf])

Sample MSA randomly, remaining sequences are stored as extra_*.

build_extra_msa_feat(batch)

Expand extra_msa into 1hot and concat with other extra msa features.

We do this as late as possible as the one_hot extra msa can be very large.

Parameters:
  • batch

    a dictionary with the following keys: * ‘extra_msa’: [num_seq, num_res] MSA that wasn’t selected as a cluster

    centre. Note - This isn’t one-hotted.

    • ’extra_deletion_matrix’: [num_seq, num_res] Number of deletions at given

      position.

  • num_extra_msa – Number of extra msa to use.

Returns:

Concatenated tensor of extra MSA features.

create_msa_feat(batch)

Create and concatenate MSA features.

create_target_feat(batch)

Create the target features

get_contiguous_crop_idx(protein, crop_size, generator)
get_interface_residues(positions, atom_mask, asym_id, interface_threshold)
get_spatial_crop_idx(protein, crop_size, interface_threshold, generator)
gumbel_argsort_sample_idx(logits, generator=None)

Samples with replacement from a distribution given by ‘logits’.

This uses Gumbel trick to implement the sampling an efficient manner. For a distribution over k items this samples k times without replacement, so this is effectively sampling a random permutation with probabilities over the permutations derived from the logprobs.

Parameters:

logits (Tensor) – Logarithm of probabilities to sample from, probabilities can be unnormalized.

Returns:

Sample from logprobs in one-hot form.

Return type:

Tensor

gumbel_max_sample(logits, generator=None)

Samples from a probability distribution given by ‘logits’.

This uses Gumbel-max trick to implement the sampling in an efficient manner.

Parameters:

logits (Tensor) – Logarithm of probabilities to sample from, probabilities can be unnormalized.

Returns:

Sample from logprobs in one-hot form.

Return type:

Tensor

gumbel_noise(shape, device, eps=1e-06, generator=None)

Generate Gumbel Noise of given Shape.

This generates samples from Gumbel(0, 1).

Parameters:
Returns:

Gumbel noise of given shape.

Return type:

Tensor

make_masked_msa(batch, config, replace_fraction, seed, eps=1e-06)

Create data for BERT on raw MSA.

make_msa_profile(batch)

Compute the MSA profile.

nearest_neighbor_clusters(batch, gap_agreement_weight=0.0)

Assign each extra MSA sequence to its nearest neighbor in sampled MSA.

randint(lower, upper, generator, device)
random_crop_to_size(protein, crop_size, max_templates, shape_schema, spatial_crop_prob, interface_threshold, subsample_templates=False, seed=None)

Crop randomly to crop_size, or keep as is if shorter than that.

sample_msa(batch, max_seq, max_extra_msa_seq, seed, inf=1000000.0)

Sample MSA randomly, remaining sequences are stored as extra_*.

Parameters:
  • batch – batch to sample msa from.

  • max_seq – number of sequences to sample.

Returns:

Protein with sampled msa.