Roland L Dunbrack, Jr, PhD
Office Phone: 215-728-2434
Statistical Analysis of Protein Structures
We have several ongoing projects in structural bioinformatics, whose purpose is twofold: understanding the determinants of the structures and dynamics of proteins and protein complexes; and providing information that is useful for improving structure prediction.
1. Backbone-Dependent Rotamer Library
We are continuing to develop our backbone-dependent rotamer library, which expresses the frequency of side-chain rotamers as a function of the backbone dihedral angles phi and psi. The most recent publicly available version of the rotamer library was developed using Bayesian statistical analyses from a set of 850 protein chains. Our rotamer libraries are used in most side-chain conformation prediction programs as well most protein design methods.
More recently, we have been developing a new backbone-dependent rotamer library that has smoothly varying probabilities and mean dihedral angles and variances. This rotamer library has been developed using non-parametric statistics using kernel density estimates and kernel regressions. It is designed primarily for use in programs that require smoothly varying functions for local minimization, such as Rosetta. It will be available in the near future (Shapovalov and Dunbrack, in preparation).Top
2. Neighbor-Dependent Ramachandran Maps
Statistics of protein backbone conformations have been studied for over 40 years. While many studies have been presented, only a handful of distributions are publicly available for use in structure validation and structure prediction methods. The available distributions differ in a number of important ways, which determine their usefulness for various purposes. These include: 1) input data size and criteria for structure inclusion (resolution, R-factor, etc.); 2) filtering of suspect conformations and outliers using B-factors or other features; 3) secondary structure of input data (e.g., whether helix and sheet are included; whether beta turns are included); 4) the method used for determining probability densities ranging from simple histograms to modern nonparametric density estimation; and 5) whether they include nearest neighbor effects on the distribution of conformations in different regions of the Ramachandran map. We have developed Ramachandran probability distributions for residues in protein loops from a high-resolution data set with filtering based on calculated electron densities. Maps for all 20 amino acids (with cis and trans proline treated separately) have been determined, as well as 420 left-neighbor and 420 right-neighbor dependent maps. The neighbor-independent and neighbor-dependent maps have been accurately estimated using Bayesian nonparametric statistical analysis based on the Dirichlet process. In particular, we used hierarchical Dirichlet process priors, which allow sharing of information between maps for a particular residue type and different neighbor residue types. The resulting maps are tested in a loop modeling benchmark with the program Rosetta, and are shown to improve protein loop conformation prediction significantly. The distributions will be made available at http://dunbrack.fccc.edu/nmhrcm on publication (Ting et al., submitted).Top
3. Protein Complexes
Much of the software and results from molecular modeling have focused on the prediction of the structure of a single protein molecule. Proteins act on other molecules, including other proteins, DNA, and ligands or substrates, and our recent focus has been on developing methods for predicting these kinds of structures. The rapidly growing number of Protein Data Bank (PDB) entries with increasing complexity and diversity provides rich information for structure prediction and modeling of protein interactions.
Many proteins function as homo-oligomers and are regulated via their oligomeric state. For some proteins, the stoichiometry of homo-oligomeric states under various conditions has been studied using gel filtration or ana- lytical ultracentrifugation experiments. The interfaces involved in these assemblies may be identified using cross-linking and mass spectrometry, solution-state NMR, and other experiments. However, for most proteins, the actual interfaces that are involved in oligomerization are inferred from X-ray crystallographic structures using assumptions about interface surface areas and physical properties. Examination of interfaces across different Protein Data Bank (PDB) entries in a protein family reveals several important features. First, similarities in space group, asymmetric unit size, and cell dimensions and angles (within 1%) do not guarantee that two crystals are actually the same crystal form, containing similar relative orientations and interactions within the crystal. Conversely, two crystals in different space groups may be quite similar in terms of all the interfaces within each crystal. Second, NMR structures and an existing benchmark of PDB crystallographic entries consisting of 126 dimers as well as larger structures and 132 monomers were used to determine whether the existence or lack of common interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. Monomeric proteins tend to have common interfaces across only a minority of crystal forms, whereas higher-order structures exhibit common interfaces across a majority of available crystal forms. The data can be used to estimate the probability that an interface is biological if two or more crystal forms are available. Finally, the Protein Interfaces, Surfaces, and Assemblies (PISA) database available from the European Bioinformatics Institute is more consistent in identifying interfaces observed in many crystal forms compared with the PDB and the European Bioinformatics Institute's Protein Quaternary Server (PQS). The PDB, in particular, is missing highly likely biological interfaces in its biological unit files for about 10% of PDB entries.Top
Programs for Protein Structure Prediction
We have developed several programs for protein structure prediction, and made these publicly available via our website.
Determination of side-chain conformations is an important step in protein structure prediction and protein design. Many such methods have been presented, although only a small number are in widespread use. SCWRL is one such method, and the SCWRL3 program (2003) has remained popular because of its speed, accuracy, and ease-of-use for the purpose of homology modeling. However, higher accuracy at comparable speed is desirable. This has been achieved in a new program SCWRL4 through: (1) a new backbone-dependent rotamer library based on kernel density estimates (described above); (2) averaging over samples of conformations about the positions in the rotamer library; (3) a fast anisotropic hydrogen bonding function; (4) a short-range, soft van der Waals atom–atom interaction potential; (5) fast collision detection using k-discrete oriented polytopes; (6) a tree decomposition algorithm to solve the combinatorial problem; and (7) optimization of all parameters by determining the interaction graph within the crystal environment using symmetry operators of the crystallographic space group. Accuracies as a function of electron density of the side chains demonstrate that side chains with higher electron density are easier to predict than those with low-electron density and presumed conformational disorder. For a testing set of 379 proteins, 86% of chi1 angles and 75% of chi1+chi2 angles are predicted correctly within 408 of the X-ray positions. Among side chains with higher electron density (25–100th percentile), these numbers rise to 89 and 80%. The new pro- gram maintains its simple command-line inter- face, designed for homology modeling, and is now available as a dynamic-linked library for incorporation into other software programs.
Ongoing work in SCWRL4 is focused on two efforts. First, we have developed a method for predicting multiple side-chain conformations for each residue, and compared these predictions with electron density calculations (Shapovalov and Dunbrack, Proteins 2007). Second, we are incorporating protein design algorithms into SCWRL4 to achieve fast and hopefully useful design capability.Top
MolIDE 1.6 (Molecular Interactive Development Environment) provides a graphical interface for basic comparative (homology) modeling using SCWRL and other programs. MolIDE takes an input target sequence and uses PSI-BLAST to identify and align templates for comparative modeling of the target. The sequence alignment to any template can be manually modified within a graphical window of the target–template alignment and visualization of the alignment on the template structure. MolIDE builds the model of the target structure on the basis of the template backbone, predicted side-chain conformations with SCWRL and a loop-modeling program for insertion–deletion regions with user-selected sequence segments.
We are currently working on a new version of MolIDE (MolIDE2) that will enable modeling of biologically relevant protein complexes, including homooligomers and heterodimers and larger structures. The input can be more than one protein sequence, and the first step is to determine the protein domain content of the input sequences using PFAM. The program then searches the PDB for any structure with one or more of the PFAM families present in the query sequences, presenting the results with the largest overlap to the queries first. So for instance, if the query consists of two proteins, protein complexes with homologues to both proteins will be presented first. With a few clicks, it is then possible to find all the biologically relevant complexes that can be built from the available templates, and to perform homology modeling from these complexes.Top