Designing real novel proteins using deep graph neural networks

Alexey Strokach1,2, David Becerra2, Carles Corbi2, Albert Perez-Riba2, Philip M. Kim1,2,3

1 Department of Computer Science, University of Toronto.
2 Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto.
3 Department of Molecular Genetics, University of Toronto.


Introduction

Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. Despite substantial advances, precisely designing sequences that fold into a predetermined shape (the “protein design” problem) remains difficult. We show that a deep graph neural network, ProteinSolver, can solve protein design phrased as a constraint satisfaction problem (CSP). To sidestep the considerable issue of optimizing the network architecture, we first develop a network that is accurately able to solve the related and straightforward problem of Sudoku puzzles. Recognizing that each protein design CSP has many solutions, we train this network on millions of real protein sequences corresponding to thousands of protein structures. We show that our method rapidly designs novel protein sequences and perform a variety of in silico and in vitro validations suggesting that our designed proteins adopt the predetermined structures.

Source code: https://gitlab.com/ostrokach/proteinsolver.

Network architecture

Figure 1. ProteinSolver network architecture.
  • The inputs to the network are a set of node attributes and a set of edge attributes describing interactions between pairs of nodes.
  • The node and edge attributes are embedded in an $m$-dimensional space using linear transformations or a multi-layer perceptron.
  • The node and edge attributes are passed through $N$ residual edge convolution and aggregation (ETA) blocks:
    • In the convolution step, we update the edge attributes using a modified version of the edge convolution layer [2], which takes as input a concatenation of node and edge attributes and returns an update to the edge attributes. In our work, $g_{\mathbf{\Theta}}^{n}$ is a multi-layer perceptron, although other neural network architectures are possible.
    • In the aggregation step, we update node attributes using an aggregation over transformed edge attributes incident on every node. In our work, $h_{\mathbf{\Theta}}^{n}$ is a learned linear transformation, although other neural network architectures, including attention layers, are possible (see [3]).
  • By training the network to reconstruct masked or missing node and edge attributes, we can learn embeddings that generalize to a variety of domains.

References

[1]. Park, Kyubyong. "Can Convolutional Neural Networks Crack Sudoku Puzzles?" (2018).

[2]. Wang, Yue, et al. "Dynamic graph cnn for learning on point clouds." (2019).

[3]. Ingraham, John, et al. "Generative Models for Graph-Based Protein Design." (2019).

Acknowledgements

Training the network to solve Sudoku puzzles

Figure 2. Training a ProteinSolver network to solve Sudoku puzzles. Node attributes encode the numbers provided in the starting Sudoku grid. Edge attributes encode the presence of a constraint between two nodes (i.e. that a given pair of nodes cannot be assigned the same number).
Datasets

The training dataset is comprised of 80 million computer-generated Sudoku puzzles. The validation dataset is comprised of 1000 computer-generated Sudoku puzzles that are not present in the training dataset. The test dataset is comprised of 30 Sudoku puzzles extracted from https://1sudoku.com [1].

Training

The network is trained to solve Sudoku puzzles by minimizing the cross-entropy loss between network outputs and correct solutions (Figure 2).

Evaluation

A trained network achieves over 85% accuracy on the validation dataset and over 95% accuracy on the test dataset (Figure 3).

Conclusion

Using Sudoku puzzles as a toy constraint satisfaction problem (CSP) allows us to quickly evaluate and optimize different neural network architectures for solving CSPs.

A
B
C
Figure 3. Evaluating the ability of a ProteinSolver network to solve Sudoku puzzles. (A) Training and validation accuracy of a ProteinSolver network as it is being trained to solve Sudoku puzzles. (B-C) Accuracy achieved by a trained network on the validation dataset, comprised of 1000 Sudoku puzzles generated in the same way as the training dataset (B), and on the test dataset, comprised of 30 Sudoku puzzles extracted from an online Sudoku puzzle provider (C). Predictions were made using either a single pass through the network (blue bars) or by running the network repeatedly, each time taking a single prediction in which the network is the most confident (red bars).
Table 1. Comparing the accuracy achieved by the ProteinSolver network trained to solve Sudoku puzzles to existing methods.
Method Validation Validation (inc.) Test Test (inc.)
CNN [1] NA NA NA 85.8%
ProteinSolver 72.2% 87.5% 83.4% 97.6%

Training the network to reconstruct protein sequences

Figure 4. Training a ProteinSolver network to reconstruct protein sequences. Node attributes encode the identities of individual amino acids. Edge attributes encode Euclidean distances between amino acids and the relative positions of those amino acids along the amino acid chain.
Datasets

The training dataset is comprised of 72 million amino acid sequences classified into 1062 Gene3D domain families. The validation and test datasets are comprised of 10,000 sequences classified into 132 Gene3D domain families that are distinct from the families comprising the training dataset. We annotated each sequence with structural information using the closest structural template in the PDB.

Training

The network is trained to reconstruct protein sequences by marking approximately half of the amino acids in each input sequence as missing and minimizing the cross-entropy loss between network predictions and the identities of the missing amino acids (Figure 4).

Evaluation

A trained network is able generate amino acid sequences, using solely geometric information from a homologous protein, with ~27% accuracy (Figure 5). Furthermore, a trained network assigns higher probability to amino acid variants that produce a more stable protein (Figure 6).

Conclusion

A ProteinSolver network trained to reconstruct protein sequences identifies important structural features and learns an embedding useful for various applications.

A
B
C
Figure 5. Evaluating the ability of a ProteinSolver network to reconstruct protein sequences. (A) Training and validation accuracy of a ProteinSolver network as it is being trained to reconstruct amino acid sequences. (B) Accuracy achieved by a trained network on the test dataset. In the case of blue bars, predictions were made using a single pass through the network. In the case of red bars, predictions were made by running the network repeatedly, each time taking a single prediction in which the network is the most confident. (C) The identity of generated sequences to the reference sequence with different fractions of amino acids being made available to the model.
A
B
Figure 6. Evaluating the accuracy of a ProteinSolver network using independent test datasets. (A) Correlations between predictions made by a trained ProteinSolver network and the changes in the Gibbs free energy of folding associated with mutations (ΔΔG). (B) Correlations between predictions made by a trained ProteinSolver network and changes in protein stability measured using a trypsin and chymotrypsin resistance assay.

Designing new protein sequences

We used a trained ProteinSolver network to generate sequences matching several predetermined geometries. Generated sequences were evaluated using an array of computational techniques, and circular dichroism spectra were experimentally obtained for several of those proteins (Figures 7 and 8). All lines of evidence suggest that the generated sequences fold into stable proteins with expected three-dimensional shapes.

A
B
C
D
E
F
G
H
Figure 7. Computational and experimental validation of sequences generated to match the architecture of a serum albumin. (A) Adjacency matrix of the reference structure provided to the network. (B) Correlation between residue probabilities and sequence identities to the reference structure for ~2 million generated sequences. (C) Distribution of Modeller MolPDF scores for homology models constructed for a subset of generated sequences. (D) Sequence LOGO showing the conservation of residues in the generated sequences. (E) Secondary structure LOGO showing predicted secondary structures for a subset of generated sequences. (F) Structure of the reference protein (white) overlaid with the structure of a model produced for one of the generated sequences using QUARK, a de-novo structural modelling tool (blue). (G) Average residue fluctuation in 100 ns molecular dynamics simulations of the reference structure and two homology models of generated sequences. (H) Circular dichroism spectra of the reference protein (black) and a generated protein (red).
A
B
C
D
E
F
G
H
Figure 8. Computational and experimental validation of sequences generated to match the architecture of an alanine racemase. The descriptions of individual subplots are analogous to those in Figure 7.