Protein structure and function is determined by the arrangement of the linear sequence of amino acids in 3D space. Despite substantial advances, precisely designing sequences that fold into a predetermined shape (the “protein design” problem) remains difficult. We show that a deep graph neural network, ProteinSolver, can solve protein design phrased as a constraint satisfaction problem (CSP). To sidestep the considerable issue of optimizing the network architecture, we first develop a network that is accurately able to solve the related and straightforward problem of Sudoku puzzles. Recognizing that each protein design CSP has many solutions, we train this network on millions of real protein sequences corresponding to thousands of protein structures. We show that our method rapidly designs novel protein sequences and perform a variety of in silico and in vitro validations suggesting that our designed proteins adopt the predetermined structures.
Source code: https://gitlab.com/ostrokach/proteinsolver.
[1]. Park, Kyubyong. "Can Convolutional Neural Networks Crack Sudoku Puzzles?" (2018).
[2]. Wang, Yue, et al. "Dynamic graph cnn for learning on point clouds." (2019).
[3]. Ingraham, John, et al. "Generative Models for Graph-Based Protein Design." (2019).
The training dataset is comprised of 80 million computer-generated Sudoku puzzles. The validation dataset is comprised of 1000 computer-generated Sudoku puzzles that are not present in the training dataset. The test dataset is comprised of 30 Sudoku puzzles extracted from https://1sudoku.com [1].
The network is trained to solve Sudoku puzzles by minimizing the cross-entropy loss between network outputs and correct solutions (Figure 2).
A trained network achieves over 85% accuracy on the validation dataset and over 95% accuracy on the test dataset (Figure 3).
Using Sudoku puzzles as a toy constraint satisfaction problem (CSP) allows us to quickly evaluate and optimize different neural network architectures for solving CSPs.
Method | Validation | Validation (inc.) | Test | Test (inc.) |
---|---|---|---|---|
CNN [1] | NA | NA | NA | 85.8% |
ProteinSolver | 72.2% | 87.5% | 83.4% | 97.6% |
The training dataset is comprised of 72 million amino acid sequences classified into 1062 Gene3D domain families. The validation and test datasets are comprised of 10,000 sequences classified into 132 Gene3D domain families that are distinct from the families comprising the training dataset. We annotated each sequence with structural information using the closest structural template in the PDB.
The network is trained to reconstruct protein sequences by marking approximately half of the amino acids in each input sequence as missing and minimizing the cross-entropy loss between network predictions and the identities of the missing amino acids (Figure 4).
A trained network is able generate amino acid sequences, using solely geometric information from a homologous protein, with ~27% accuracy (Figure 5). Furthermore, a trained network assigns higher probability to amino acid variants that produce a more stable protein (Figure 6).
A ProteinSolver network trained to reconstruct protein sequences identifies important structural features and learns an embedding useful for various applications.
We used a trained ProteinSolver network to generate sequences matching several predetermined geometries. Generated sequences were evaluated using an array of computational techniques, and circular dichroism spectra were experimentally obtained for several of those proteins (Figures 7 and 8). All lines of evidence suggest that the generated sequences fold into stable proteins with expected three-dimensional shapes.