COMPUTATIONAL STRUCTURAL BIOLOGY: STRUCTURE PREDICTION, ANALYSIS, AND BIOINFORMATICS



Dr. 
Roland L. Dunbrack ROLAND L. DUNBRACK, Jr., Ph.D.,
Associate Member

JONATHAN W. ARTHUR, Ph. D., Postdoctoral Associate (from December 1998)
J. MICHAEL SAUDER, Ph. D., Postdoctoral Associate (from October 1998)
MATTHEW SAZINSKY, Student Assistant, Haverford College, Haverford PA (from June 1998)
JOHN BREYER, Student Assistant, Pennsylvania State University, Abington PA (May to September 1998)

The focus of our research is computational structural biology, including homology modeling, fold recognition, molecular dynamics simulations, statistical analysis of the Protein Data Bank (PDB), and bioinformatics. We are interested in understanding those factors that govern protein folding to give a stable structure, and the structural similarities and differences among evolutionarily related proteins. We are developing rigorously tested prediction methods, and are interested in applying these methods to problems in various areas of biology; these areas include the structural analysis of disease-causing mutations in human proteins, particularly the products of cancer-susceptibility genes.

PROTEIN FOLD RECOGNITION BY DETECTION OF REMOTE HOMOLOGUES. DUNBRACK, SAUDER, BREYER, SAZINSKY

Producing a prediction or model of a protein structure from its sequence requires a number of steps. The first step is the identification of a protein related by evolution whose structure is known. The three-dimensional coordinates of most protein structures that have been experimentally determined are available in the PDB, and may be accessed on the World Wide Web. The second step is the alignment of the query sequence (the protein of unknown structure) to the sequence and structure of the protein in the PDB identified in Step 1 (the parent structure). In practice, an initial alignment is produced during the identification step, and this subsequent alignment step comprises any manual or automatic refinements. The third step involves modeling of any insertions or deletions between the two sequences by adding or removing residues from the parent structure. And, the fourth step involves modeling of sidechain conformations for the query sequence on the template backbone produced from Step 3. In practice, Steps 3 and 4 may be performed together, since the loop conformation chosen must be compatible with reasonable sidechain conformations for the amino acids in the loop and those amino acids nearby. In previous work, we developed sidechain prediction methods (Step 4) for modeling sidechains on both self backbones and homologous protein backbones (Dunbrack and Karplus, J. Mol. Biol. 230:543, 1993; Bower et al., J. Mol. Biol. 267:1268,1997). Our software program, SCWRL (Sidechains With a Rotomer Library), is currently licensed to over 220 laboratories around the world.

In our recent work, we have been developing methods to identify remote homologues in the PDB to sequences of unknown structure (Step 1). At times this identification is quite easy. If the proteins share significant sequence identity, usually greater than 30%, a simple BLAST (Basic Local Alignments Research Tool) search or other pairwise sequence comparison method suffices. If the overlap in sequence lies below the 30% level, however, more advanced computational methods are necessary to detect remote homologues and to judge the statistical significance of the match. The methods we are developing are based on exploiting relationships among many related sequences so that a connection between very distantly related sequences can be established; the relationships employed are illustrated in Figure 1.


Figure 1

FIGURE 1. Schematic relationships between a query sequence Q of unknown structure, its relatives (Q1, Q2, etc.), a profile built from an alignment of its relatives (Q-profile), and the sequence of a protein whose structure is known (P), its relatives (P1, P2, etc.), and a profile built from an alignment of its relatives (P-profile). The distance between two proteins/profiles represents schematically the evolutionary distance between the sequences at those points.

We start with the query sequence (Q), which we can compare directly with sequences in the PDB, and perhaps find a PDB sequence (P) that is related by homology (as determined by BLAST significance scores). If this does not succeed, our next step is to find relatives, Q1, Q2, Q3, etc., of Q in the GenBank non-redundant protein sequence database using the Position-Specific Iterated (PSI)-BLAST program developed by Altschul and colleagues (Altschul et al., Nucleic Acids Res. 25:3389, 1997). Any of these sequences can be compared to sequences in the PDB in hopes of locating a sequence, P, which is related to Q via the intermediate sequence, Qn. We are using the transitivity principle: if A is similar to B, and B is similar to C, then A is similar to C. PSI-BLAST can also be used to build a profile (Q-profile) of the protein family that contains information of sequence variation in the family at each position in the query sequence; a profile is the percentage of 20 amino acids versus the length of the sequence. This profile can be used to find more remote homologues of Q (i.e., Q7, Q8, etc.) that are more similar to the profile than they are to the initial query sequence. This process is iterative, since at each step new sequences can be added to previously found sequences to create a new profile. After several steps, this profile can be used to search a database of PDB sequences, which we derive ourselves from the PDB entry files, to find P. We can also compare the Q, Q-profile, and the Qn sequences to a database of all GenBank sequence relatives of PDB sequences (P1, P2, P3, etc.) that we have pre-compiled using PSI-BLAST. In sum, anything on the left-hand side of Figure 1 can be compared to anything on the right-hand side of Figure 1 to establish a set of links between Q and some sequence in the PDB, P.

We have tested several of the possible schemes that can be derived from Figure 1 on the Structural Classification of Proteins (SCOP) database developed by Murzin and colleagues (Murzin et al., J. Mol. Biol. 247:536, 1995). The SCOP database links together proteins in the PDB by their structure and function to establish remote evolutionary relationships between different proteins. Proteins classified in the same "superfamily" are considered related by evolution, even though their sequence identities may be as low as 10%. In our tests, for proteins that share less than 40% sequence identity in the SCOP database, directly comparing Q with all the possible sequences, P, in the PDB detected only 6% of the possible relationships. Comparing Q-profile with P detected 14% of the possible relationships. Comparing Q-profile with all relatives of PDB sequences, a database of all Pn, detected 22% of all such relationships. Comparing all relatives of Q (i.e., Qn) against P detected 40% of such relationships, and therefore appeared to be the most successful strategy.

ALIGNING SEQUENCES TO HOMOLOGOUS PROTEIN STRUCTURES. DUNBRACK, ARTHUR, SAUDER

To produce a good model for a protein of unknown structure from a related protein in the PDB, we must align Q to the sequence and structure of the PDB protein; this sequence-structure alignment process is Step 2 in homology modeling. First, we measure how well our identification methods perform in terms of the quality of the sequence-structure alignments. Then, we can set about to improve these alignments using a variety of techniques.

The standard for good alignment of two sequences is the alignment that can be derived from the best superposition of the structures of the proteins. Such an alignment will usually overlap helices and sheet strands, leaving the many insertions and deletions in coil segments unaligned. We compared the abilities of pairwise (Q versus P) and profilewise (Q-profile versus P) sequence alignments to reproduce structure alignments. To our surprise, the two methods had broadly similar abilities to align sequences when compared with the structural alignments. For proteins with less than 30% identity, roughly 70% of residues are aligned correctly. However, the profilewise method produces significantly longer alignments, compared to the pairwise method. That is, a model based on the profilewise results would comprise a larger proportion of the query sequence than a pairwise model. In future work, we plan to alter the functionality of PSI-BLAST to include structural information in the alignments and compare the quality of these alignments with the profilewise benchmarks already obtained.

PROTEIN STRUCTURES IN GENOMES. DUNBRACK, SAUDER, SAZINSKY

We used our protein fold identification methods to assign all possible sequences to different folds within the PDB for the protein sequences in four genomes: M. genitalium, E. coli, S. cerevisiae, and C. elegans. We were able to establish homologues for 37% of the Mycoplasma sequences, 28% of the E. coli sequences, 30% of the yeast genome, and 25% of the C. elegans genome. The mean sequence identity over these four genomes (i.e., between sequences in the genome and their nearest PDB relatives) was 27% and the median sequence identity was 22%. From these results, it is clear that the kind of modeling by homology that will most often be required in biological studies will be in the realm of fairly low sequence identity, 10-30%, and with numerous gaps and insertions in the sequence.

We have focused on the recently completed genome of C. elegans, since it is the first animal genome to be completed and the largest genome yet sequenced. The genome encodes 19099 protein sequences. In Figure 2, we show the top four folds found in the genome. These four represent 5% of the sequences in the genome. The top 8, 28, and 57 folds represent 10%, 20%, and 25% of the genome.


Figure 2: The four most common folds in the C. elegans genome as 
determined 
by sequence homology to proteins in the PDB.

FIGURE 2. The four most common folds in the C. elegans genome as determined by sequence homology to proteins in the PDB.

STRUCTURE PREDICTION OF PROTEINS STUDIED IN OTHER FOX CHASE RESEARCH GROUPS. DUNBRACK, ARTHUR, SAUDER in collaboration with KRUGER,§ COLEMAN,§, GOLEMIS,§ JAFFE,§ YEUNG,§ KRUH,§ TAYLOR,§ BELLACOSA,§ YAN,§ GODWIN,§ BURCH,§ MOSS,§ WIEST,§ FENG,§ PATRIOTIS,§ HENSKE,§ STRICH§

We have used a number of protein structure prediction methods on proteins of interest to researchers at Fox Chase. We first attempted to find homologous proteins within the PDB, using our sequence-based methods described above. In cases where no homologue was found, we performed other calculations such as secondary structure prediction, coiled-coil region prediction, and trans-membrane segment prediction. These methods give us at least some idea of what the protein looks like, and may help a researcher to decide which residues might be important for function and, therefore, further experimental study. The results of three given below illustrate the usefulness of our method.

Cystathionine b synthase. Dr. Kruger studies the enzyme, cystathionine b-synthase (CBS), a pyridoxal phosphate binding enzyme that catalyzes the synthesis of cystathionine from serine and homocysteine. High levels of homocysteine in patients with mutations in this enzyme have been linked to heart disease. Sequence searches of the GenBank protein sequence database using BLAST and PSI-BLAST reveal a statistically significant homology (19% sequence identity) between a large portion of the sequence of CBS with the beta chain of tryptophan synthase from Salmonella typhimurium. Of the inherited mutations that reduce the activity of the enzyme, a mutation that completely abrogates enzyme activity, G307S, is very close to the active site of the enzyme.

Cdc6. Cell division cycle (Cdc) 6, studied by Dr. Coleman, binds to DNA in the process of chromosome duplication prior to mitosis. Genetic analysis in yeast and Xenopus has indicated that Cdc6 homologues are essential proteins involved in an early stage of S-phase, that Cdc6 cannot bind DNA in the absence of the origin recognition complex, and that an essential component of the replication machinery, the minichromosomal maintenance family (MCMs), cannot bind DNA in the absence of Cdc6. Our fold identification methods detected a weak similarity to the structure of the E. coli clamp-loader domain. This protein is stimulated by nonhydrolyzable ATP, while association with bacterial homologues of the MCMs requires ATP hydrolysis. By analogy, it appears that Cdc6 may load the MCM proteins onto DNA in a similar manner. Knowing the structure has allowed Dr. Coleman to engineer expression of the clamp-loader domain of Cdc6, in the absence of the remaining N-terminal and a C-terminal domains of Cdc6.

Dim1. The S. pombe protein, Dim1 (defective entry into mitosis 1), was isolated by Berry and Gould (J. Cell Biol. 137:1337, 1997). Dim1 is a regulator of cell cycle progression, with a critical function in regulation of the mitotic spindle at the G2/M boundary Dr. Golemis began work on the human homologue of Dim1, when she isolated this gene as an interacting partner for a mitotic-spindle associated form of human enhancer of filamentation 1 (HEF1). Our identification methods strongly predicted a fold for Dim1 similar to that of thioredoxin. The validity of the structure prediction is being tested by the determination of the structure of Dim1 by NMR in Dr. Roder's laboratory, and preliminary data on the secondary structure of Dim1 is highly consistent with the thioredoxin fold.

Other proteins. We have studied a number of proteins in collaboration with other research groups at Fox Chase, including porphobilinogen synthase (JAFFE), CEL1 endonuclease (YEUNG), the Arg oncogene (KRUH), the hepatitis delta antigen (TAYLOR), MED1 (BELLACOSA), FFA-1 (YAN), BRCA1 and SNF5 (GODWIN), retrotransposon-encoded proteins (BURCH), lin14 and lin46 from C. elegans (MOSS), the pre-TCR protein (WIEST), touvlo (FENG), TPL kinase (PATRIOTIS), tuberous-sclerosis associated proteins (HENSKE), and UME3 peptides (STRICH).

PREDICTIONS FOR THE THIRD MEETING ON THE CRITICAL ASSESSMENT OF PROTEIN STRUCTURE PREDICTION. DUNBRACK, BREYER

Published methods for protein structure prediction frequently suffer from being tested on sets of known protein structures that are too small. It is easy to choose, even unconsciously, test cases that behave better than the average test case. To counter this, we have generally chosen to test our methods on very large sets of proteins. Another way to counter this is to set up blind test cases, where the predictor does not know the answer beforehand. This is precisely the purpose the Critical Assessment of Protein Structure Prediction (CASP) series of meetings, organized by John Moult (University of Maryland). Over the summer and spring of 1998, the sequences of 45 proteins whose structures were being experimentally determined were posted on the CASP3 website. Research groups from around the world were invited to make predictions of the structures and send these predictions to the CASP3 organizers by a certain deadline. After this deadline, the experimental structures were released. Experts in the field assessed the predictions, and predictors who had done well were asked to speak at a meeting of all predictors in early December 1998. Based on our results, the Dunbrack group was asked to make a presentation at the meeting. Figure 3 shows our results compared to the results of other research groups in terms of the success of our sequence-structure alignments and our sidechain conformation predictions. In 8 of 11 cases, our alignment prediction is in the top segment of predictions (Figure 3 left). In 3 cases, we are in the



Figure 3: left-hand


Figure 3, right-hand

FIGURE 3. Results from the CASP3 predictions. Results from other research groups around the world are shown in open hexagons. Our results are shown in filled hexagons. Left. Comparison of percentage correct sequence-structure alignments for 11 CASP3 target proteins studied. Right. Comparison of percentage-correct sidechain c1 rotamer predictions for 10 targets studied. In most cases, our results are significantly better than the median prediction.

middle of the pack (T0055, T0076, T0082). For the sidechain predictions, in 9 of 10 predictions our results were at the top or very near the top prediction (Figure 3 right). The T0070 prediction is in the middle of the pack. In this case, no one achieved a sidechain prediction accuracy above 50%, which is astonishingly low. It should be noted that T0076 is a NMR structure and T0070 is described as "poorly refined" by the crystallographers. In both cases, it is likely that the sidechains are poorly placed, which is supported by the poor predictions of all participants in CASP3.

PUBLICATIONS

MACKERELL, A.D., Jr., BASHFORD, D., BELLOTT, M., DUNBRACK, R.L., Jr., EVANSECK, J., FIELD, M.J., FISCHER, S., GAO, J., GUO, H., HA, S., JOSEPH-McCARTHY, D., KUCHNIR, L., KUCZERA, K., LAU, F.T.K., MATTOS, C., MICHNICK, S., NGO, T., NGUYEN, D.T., PRODHOM, B., REIHER, W.E., ROUX, B., SCHLENKRICH, M., SMITH, J., STOTE, R., STRAUB, J., WATANABE, M., WIÓRKIEWICZ-KUCZERA, J., YIN, D., KARPLUS. M. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102:3586-3616, 1998.

WILLIAMS, M., LYU, M.-S., YANG, Y.-L., LIN, E.P., DUNBRACK, R.L., Jr., BIRREN, B., CUNNINGHAM, J., HUNTER, K. Ier5, a novel member of the slow-kinetics immediate-early genes. Genomics (in press).

§   Fox Chase researcher

Illustrations or unpublished data in these reports should not be used without permission of the author.


Fox Chase Cancer Center Scientific Report 1998