Designing real novel proteins using deep graph neural networks

Alexey Strokach1,2, David Becerra2, Carles Corbi2, Albert Perez-Riba2, Philip M. Kim1,2,3

1 Department of Computer Science, University of Toronto.2 Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto.3 Department of Molecular Genetics, University of Toronto.

Introduction

Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. Despite substantial advances, precisely designing sequences that fold into a predetermined shape (the “protein design” problem) remains difficult. We show that a deep graph neural network, ProteinSolver, can solve protein design phrased as a constraint satisfaction problem (CSP). To sidestep the considerable issue of optimizing the network architecture, we first develop a network that is accurately able to solve the related and straightforward problem of Sudoku puzzles. Recognizing that each protein design CSP has many solutions, we train this network on millions of real protein sequences corresponding to thousands of protein structures. We show that our method rapidly designs novel protein sequences and perform a variety of in silico and in vitro validations suggesting that our designed proteins adopt the predetermined structures.

Source code: https://gitlab.com/ostrokach/proteinsolver.

Network architecture

• The inputs to the network are a set of node attributes and a set of edge attributes describing interactions between pairs of nodes.
• The node and edge attributes are embedded in an $m$-dimensional space using linear transformations or a multi-layer perceptron.
• The node and edge attributes are passed through $N$ residual edge convolution and aggregation (ETA) blocks:
• In the convolution step, we update the edge attributes using a modified version of the edge convolution layer [2], which takes as input a concatenation of node and edge attributes and returns an update to the edge attributes. In our work, $g_{\mathbf{\Theta}}^{n}$ is a multi-layer perceptron, although other neural network architectures are possible.
• In the aggregation step, we update node attributes using an aggregation over transformed edge attributes incident on every node. In our work, $h_{\mathbf{\Theta}}^{n}$ is a learned linear transformation, although other neural network architectures, including attention layers, are possible (see [3]).
• By training the network to reconstruct masked or missing node and edge attributes, we can learn embeddings that generalize to a variety of domains.

References

[1]. Park, Kyubyong. "Can Convolutional Neural Networks Crack Sudoku Puzzles?" (2018).

[2]. Wang, Yue, et al. "Dynamic graph cnn for learning on point clouds." (2019).

[3]. Ingraham, John, et al. "Generative Models for Graph-Based Protein Design." (2019).

Training the network to solve Sudoku puzzles

Datasets

The training dataset is comprised of 80 million computer-generated Sudoku puzzles. The validation dataset is comprised of 1000 computer-generated Sudoku puzzles that are not present in the training dataset. The test dataset is comprised of 30 Sudoku puzzles extracted from https://1sudoku.com [1].

Training

The network is trained to solve Sudoku puzzles by minimizing the cross-entropy loss between network outputs and correct solutions (Figure 2).

Evaluation

A trained network achieves over 85% accuracy on the validation dataset and over 95% accuracy on the test dataset (Figure 3).

Conclusion

Using Sudoku puzzles as a toy constraint satisfaction problem (CSP) allows us to quickly evaluate and optimize different neural network architectures for solving CSPs.

Table 1. Comparing the accuracy achieved by the ProteinSolver network trained to solve Sudoku puzzles to existing methods.
Method Validation Validation (inc.) Test Test (inc.)
CNN [1] NA NA NA 85.8%
ProteinSolver 72.2% 87.5% 83.4% 97.6%

Training the network to reconstruct protein sequences

Datasets

The training dataset is comprised of 72 million amino acid sequences classified into 1062 Gene3D domain families. The validation and test datasets are comprised of 10,000 sequences classified into 132 Gene3D domain families that are distinct from the families comprising the training dataset. We annotated each sequence with structural information using the closest structural template in the PDB.

Training

The network is trained to reconstruct protein sequences by marking approximately half of the amino acids in each input sequence as missing and minimizing the cross-entropy loss between network predictions and the identities of the missing amino acids (Figure 4).

Evaluation

A trained network is able generate amino acid sequences, using solely geometric information from a homologous protein, with ~27% accuracy (Figure 5). Furthermore, a trained network assigns higher probability to amino acid variants that produce a more stable protein (Figure 6).

Conclusion

A ProteinSolver network trained to reconstruct protein sequences identifies important structural features and learns an embedding useful for various applications.

Designing new protein sequences

We used a trained ProteinSolver network to generate sequences matching several predetermined geometries. Generated sequences were evaluated using an array of computational techniques, and circular dichroism spectra were experimentally obtained for several of those proteins (Figures 7 and 8). All lines of evidence suggest that the generated sequences fold into stable proteins with expected three-dimensional shapes.