Methods in Protein Engineering

Directed Evolution


We mimic natural selection to develop proteins with novel and advanced properties. Through iterative rounds of mutation and screening/selection, we traverse fitness landscapes to find optimal proteins for user defined goals.

Directed evolution circumvents our profound ignorance of how a protein’s sequence encodes its function by using iterative rounds of random mutation and artificial selection to discover new and useful proteins. Proteins can be tuned to adapt to new functions or environments by simple adaptive walks involving small numbers of mutations. Directed evolution studies have shown how rapidly some proteins can evolve under strong selection pressures and, because the entire ‘fossil record’ of evolutionary intermediates is available for detailed study, they have provided new insight into the relationship between sequence and function. Directed evolution has also shown how mutations that are functionally neutral can set the stage for further adaptation.

Machine Learning

We use data-driven, statistical algorithms to engineer proteins for unrivaled, novel properties.


 While directed evolution is a powerful tool for obtaining desired protein properties, it can be resource-intensive. Screening and selection ignores information from all but the highest-performing variants. Machine-learning methods allow us to efficiently explore sequence space by learning patterns from each round of evolution to guide subsequent rounds.

We have applied Gaussian process regression and classification to optimize properties in two highly divergent systems: cytochrome P450s and channelrhodopsins. The properties we engineered are difficult to improve by directed evolution alone because they are difficult to screen. By using Gaussian processes to model the fitness landscape, we were able to efficiently navigate the landscape to discover improved variants. Using this method, we were able to improve thermostability, ligand binding, and activity in P450s. More recently, we designed a library of channelrhodopsins that is enriched in variants that localize correctly to the plasma membrane in mammalian cells. We are continuing to explore new machine learning methods and applications for statistically directed evolution. 


Crystallography enables structural characterization of proteins, providing molecular insights and guiding protein design. 

We use x-ray crystallography to structurally characterize the proteins we have engineered. We can visualize protein-subunit interfaces involved in activity regulation, active site organization of our enzymes, and substrate and cofactor binding-sites. Visualizing our advanced protein variants at the molecular level tells the story behind beneficial mutations. These crystal structures provide the foundation of our protein design efforts.

Check out our structures!


Structure-Guided Recombination

We have developed structure-guided recombination methods to create novel, highly functional protein diversity. 

We are trying to understand the benefits of recombination (sex) in evolution. We also want to understand how to use it efficiently to make new proteins with new features and functions. Sex in the test tube is not limited to two parents, nor to sequences from the same species. We can recombine 32 parents. Or sequences from monkeys and worms. We want to understand the rules for molecular sex: how to do it, what it can make, and what we can learn from it. We have observed, for example, that sex in the test tube is an innovation generation machine.

Homologous recombination is remarkably efficient for searching sequence space for functional proteins (i.e. it has a good chance of creating functional proteins) due to the conservative nature of homologous substitutions (they are less disruptive on average than random substitutions) and to the conservative nature of swapping blocks of sequence among related proteins. Chimeric proteins inherit the best and worst residues the parents have to offer, in new combinations that are not observed in nature. This leads to functional innovation.

We have developed computational tools that use protein structure information to design chimeric proteins and libraries of such proteins. These libraries are extremely diverse, with members that differ by tens or even hundreds of mutations while still maintaining a high proportion of sequences that fold and function. These chimeric proteins can be more stable than any of their parents. They can also catalyze reactions better than their parents, or even reactions their parents do not catalyze. We have also discovered that recombination leads to simplified (additive) sequence-function relationships that can be exploited to predict useful new sequences based on data from a small sampling of chimeras.

SCHEMA Recombination

Homologous recombination means swapping pieces of protein (blocks) between a set of homologs (parental proteins). The goal of site-directed SCHEMA recombination is to simultaneously maximize the mutation level of the chimeras and the probability the chimeric proteins will fold and function. We do this by minimizing the number of structural contacts that are disrupted when portions of sequence are inherited from different parent proteins. Using SCHEMA, we have made functional chimeras from parents sharing as little as 30% sequence identity. Guided by structural information, we have designed and constructed recombination libraries of a variety of proteins, including beta-lactamases, arginases, cytochrome P450s, GH48 cellulases, GH6 cellulases, GH7 cellulases, and Channelrhodopsins.

We have discovered that the recombination fitness landscape has a large additive component, which enables us to use simple linear regression models built from small data sets to predict highly stable chimera sequences. Homologous recombination thus gives us the opportunity to create and study a large number of functional enzymes whose properties vary significantly. With empirical models, we can accurately predict some of these properties and use these predictions to search for improved enzymes. We can also identify the sequence basis for variations in function.

Non-contiguous recombination


We have extended our recombination design tools to include libraries where the blocks are not necessarily contiguous in the primary sequence. Although not contiguous along the polypeptide chain, the blocks are contiguous on the folded 3-D structure of the protein. Non-contiguous recombination further reduces structural disruption, as important contacts between residues not next to each other in the protein chain can be preserved. We expect that this will allow us to design chimeras and chimera libraries using more distantly-related parent proteins, further increasing the diversity of chimera progeny.