Alejandro Panjkovich & Xavier Daura
November 29, 2013
This documentation explains how to use the PARS web server, describes its input and results/output sections and includes a brief tutorial. A benchmark evaluating the performance of PARS on a set of 102 allosteric proteins is included as well.
Once the structure is uploaded to the server, a set of parameters is available and presented at the `Job parameters' page:
Huang and Schroeder, 2006] and these sites will then be analyzed in terms of their structural conservation and their potential to affect protein flexibility, because both measures have been found of relevance in the characterization of protein allosteric sites [Panjkovich and Daura, 2012]. An experimentally solved protein structure may include small-molecule ligands and you can query these positions as well by simply selecting the ligands in the `Job parameters' form. By default, all ligands are pre-selected. Normal mode analysis is executed for each ligand-protein complex, meaning that total job execution time will increase with the number of included ligands.
Huang and Schroeder, 2006] by default. You may turn off this option by unchecking the corresponding checkbox if, for example, you just want to focus on binding sites which are already occupied by cocrystallized ligands. Turning off this option will shorten execution time considerably.
Panjkovich and Daura, 2010]. Additionaly, part of the evolutionary queries involved in this process are used to predict active-site residues as well [Mistry et al., 2007]. If you already know the results of this analysis on your protein of interest or if you are only interested in the dynamic aspects of the analysis, you may turn off the estimation of structural conservation.
Panjkovich and Daura, 2012], we found out that around a third of the allosteric sites were not identified correctly as ligand-binding sites by LIGSITEcs [Huang and Schroeder, 2006]. This is understandable because allosteric sites, when compared to active sites or other primary ligand-binding sites, tend to have more planar shapes, which renders their identification harder. Besides LIGSITEcs, there are multiple programs available that predict ligand-binding sites on protein structures, such as FPOCKET [Guilloux et al., 2009] or CONCAVITY [Capra et al., 2009], among many others. In general, these programs display similar performance, meaning that well defined pockets are easily detected while difficult cases are often missed, as seen in large-scale comparisons [Chen et al., 2011,Schmidtke et al., 2010]. To partially overcome this limitation, you may use PARS to query alternative sites that are not already occupied by ligands or predicted by LIGSITEcs (e.g. predicted by your program of choice). To do so, you can mark those particular sites by including dummy molecules as HET/HETATM entries in the structure file before uploading it. They will appear in the ligands-selection table once the structure is uploaded.
Eswar et al., 2008]. In order to use this feature, you will need to provide your own MODELLER License key, which is freely available for academics at the following URL: http://salilab.org/modeller/registration.html
Huang and Schroeder, 2006] are identified using CAV as hetID and Z as chain identifier. They are numbered according to volume in decreasing order, meaning that the largest cavity is number 1, while cavity number 2 is the second largest and so on. Cavities and ligands in the table are ranked according to their potential as putative allosteric sites according to features described below. The higher the rank, the higher the chance that the site is an allosteric site.
The first column of values displays a p-value for that cavity, which is used to evaluate whether an observed change in overall protein flexibility upon ligand binding is statistically significant [Panjkovich and Daura, 2012]. Values equal or below 0.05 are considered significant, i.e., the site is considered to have a potential regulation activity, and the number is highlighted on the table. The second column corresponds to the structural conservation estimated for that cavity. Structural conservation values correspond to the percentual fraction of representative protein structures of a Pfam family on which a pocket was identified in the same position. For example, a structural conservation of 100% would mean that the pocket is found in all structures of that protein family [Panjkovich and Daura, 2010]. We have selected an arbitrary threshold of 50% to consider cavities as structurally conserved. For some protein families there is not enough structural information to perform the estimation. In those cases the structural conservation column is not displayed.
The results table can be downloaded in plain text format through a link provided at the bottom of the page.
Mistry et al., 2007]. If active site residues are predicted, they will be shown in white `sticks' representation on both the Jmol and PyMOL visualizations. Additionally, predicted cavities and/or cocrystallized ligands that are near such catalytic residues (average distance less than 9 Å) will be colored in white as well. Because such sites may bind substrates and/or orthosteric modulators, they are marked with the keyword CAT in the results table and penalized in the ranking to distinguish them from allosteric sites. This distinction improves the correct identification of experimentally known allosteric sites (see Benchmark section 6).
If the job involved the generation of homology models, another link will appear for you to download the details of that procedure, such as sequence alignments and evaluation scores generated by MODELLER [Eswar et al., 2008].Panjkovich and Daura, 2012]. If you are interested in visualizing the actual normal modes of vibration that have been calculated for the submitted structure, please click on the `Compare normal modes visually' link available under `Further options' at the main results webpage.
The link will lead you to a new webpage where two Jmol displays are presented next to each other. Using the drop-down menus at the bottom of the Jmol windows, you can select which normal mode you want to display (the vibration will be animated by Jmol). You can activate the `NMA springs' option to observe the harmonic springs that are used to connect alpha carbons during the normal mode analysis.
In some cases there may be a clearly observable difference between apo and bound normal modes if, for example, the protein movement is restricted by the presence of the ligand, but this is not always straightforward to note visually. Furthermore, normal modes may switch order through different calculations (e.g. the normal mode 7 for an apo structure may correspond to the motion described by normal mode 8 on one of the ligand-bound structures). PARS overcomes this difficulty by transforming the lowest frequency normal modes (up to normal mode 20) into B-factors. Thus, it may be easier to understand the particular effect a site is exerting on protein flexibility upon ligand binding by looking at the B-factors that are derived from the NMA by PARS. You can do so through the utility that is described in the next section.
A new page will load displaying the results table as a guide. The protein structure is loaded on Jmol with a new coloring scheme. Below the Jmol display, a menu is available for you to select among different visualizations. The first option shows the protein structure and all the sites analyzed. Sites wil be colored yellow for a non-significant effect on protein flexibility and red if a significant effect was detected by PARS. Next, NMA derived B-factors for the apo or unbound protein structure can be visualized in temperature coloring (blue: low, red: high). The rest of the options correspond to the protein structure colored according to the difference between apo NMA derived B-factors and the NMA derived B-factors when a ligand is placed at the indicated site, with warmer colors corresponding to larger differences. This is helpful to understand which regions are affected by particular sites upon ligand binding, according to the NMA.
The second column in the table contains structural conservation values as percentages. Values equal or higher than 50% are considered relevant, meaning that the site has been conserved during the evolution of this protein family. In our example with PDB entry 1CET, cavities CAV_2_Z, CAV_2_Z, CAV_4_Z and CAV_8_Z show relevant structural conservation values.
On the right, and if JAVA is working properly on your browser, the protein structure of L-lactate dehydrogenase is displayed along with the predicted cavities. A color scheme is applied to reflect the different results obtained for each site. Cocrystallized ligands in full atom balls and stick representation and predicted cavities (displayed as geometric center dummy spheres) are colored yellow by default.
In our example, CAV_4_Z is colored cyan because it is structurally conserved, but the PARS analysis did not consider this site relevant in terms of altering protein flexibility upon ligand binding. CAV_1_Z and CAV_8_Z have been colored white, because they are very close to predicted catalytic residues. Catalytic residues are also colored white and displayed using sticks representation.
CAV_2_Z is colored red because it shows relevant values for both structural conservation and flexibility (it is ranked at top 1 position in the ranking table), meaning that it is the best candidate to be a regulatory or allosteric site. In fact, this is a correct prediction because the cavity matches the position of the known allosteric site of this protein.
2. PARS was developed to work on protein structures. However, if no structure is available for your protein of interest, the server will take the protein sequence as input and it will attempt to generate a structural model using the MODELLER software [Eswar et al., 2008]. This can only be done if appropriate structural templates are found, which is not always the case.
Once the structure is uploaded, or modeled by homology, the PARS algorithm is applied to characterize putative binding pockets regarding their potential as regulatory or allosteric sites. The details of the PARS algorithm are explained in the next section.
The procedure is as follows: (1) Initially the user uploads a protein-structure file (PDB format) and selects which chains and ligands should be considered for the calculations. (2) Once the job is submitted, the protein surface is analyzed to predict putative ligand-binding sites using the LIGSITEcs program [Huang and Schroeder, 2006]. (3) At each predicted position, a simplified representation of a small-molecule is placed to simulate the presence of a ligand. We use an octahedron representation where the ligand's presence is simulated by a dummy atom positioned at the geometric center and six extra dummy atoms located at 4 Å distance from the center on both sides of each axis (i.e. forming the vertices of a regular octahedron). Alternatively, upon submission, the prediction of putative ligand-binding sites can be turned off if you are interested in scanning only sites which are already occupied by a ligand in the protein structure, this will result in faster execution times. (4) Normal mode analysis (NMA) is carried out for the apo structure (without ligands) using a set of available programs [Tama et al., 2000,Delarue and Sanejouand, 2002]. (5) For each binding site, a NMA is executed for the protein-ligand complex. (6) Normal modes are transformed into B-factors and the Wilcoxon-Mann-Whitney test is applied to check for a significant difference in overall B-factors between the apo and ligand-bound states of the protein. If a significant difference is found (Wilcoson-Mann-Whitney test p-value equal or less than 0.05), the binding site is marked as potentially allosteric.
Moreover, if enough structural data is available for the protein family, the structural conservation of each pocket is also measured.
We have previously described how allosteric sites may be structurally conserved within a protein family [Panjkovich and Daura, 2010] and that the incorporation of this measure improves the capacity of the method based on dynamics to correctly identify allosteric sites [Panjkovich and Daura, 2012].
Additionaly, and to complement the characterization of your protein of interest, PARS will attempt to predict active site residues, using an existing methodology [Mistry et al., 2007].
Tsai et al., 2009]. In addition, our method is based on a coarse-grained normal mode analysis, which is a harmonic approximation to flexibility based on a simplified representation of the protein molecule (i.e. alpha carbons only) [Tama et al., 2000]. Moreover, as mentioned in detail in the section regarding custom ligands and binding sites, the approach has limitations as to how often allosteric sites are identified correctly as ligand-binding sites at the initial steps of the procedure, independently of the background software used. Despite these limitations, the method can be used successfully to detect allosteric sites as illustrated by the benchmark described in the next section.
Panjkovich and Daura, 2012] to include the latest 2013 AlloSteric Database (ASD) entries [Huang et al., 2011]. Briefly, protein structures of least 3.0 Å resolution that included the cocrystallized allosteric ligand were clustered (30% sequence identity) to avoid overrepresentation into a manually curated selection of 102 structures.
1 shows the results for the different criteria and their combination (FS). Moreover, cavities are predicted by LIGSITEcs in descending volume or size, meaning that cavity number 1 is the largest, cavity 2 is the second largest and so on. As described previously, the order of the cavity may be important, given that usually the largest cavities match biologically relevant ligand-binding sites [Panjkovich and Daura, 2012]. This is why benchmark results are presented in three different groups, one includes all predicted cavities (up to 8 per structure, first four rows in Table 1), c123 includes only the three largest cavities (i.e. the default approach by LIGSITEcs) and finally c1 considers only the largest cavity (last four rows in Table 1). Cavities that are too close to predicted catalytic residues [Mistry et al., 2007] and may correspond to the active site are discarded.
Results on rows 1, 5 and 9 of Table 1 illustrate how difficult it is to identify the allosteric site based only on LIGSITEcs predictions. Incorporating structural conservation (S) or the flexibility criterium (F) increases accuracy, specificity and positive predictive value (PPV) in all cases, but sensitivity decreases. When looking at the first four rows of Table 1 (default behaviour of PARS), the combination of F and S criteria (FS, row 4) allows PPV to triplicate when compared to plain LIGSITEcs (row 1). Accuracy goes from 0.12 to 0.88 and specificity rises to 0.96 from the original 0.03 at the expense of sensitivity, which falls to 0.22. Numbers for the c123 and c1 sets follow the same trend, i.e. accuracy, precision and PPV increase at the cost of sensitivity. This means that even if some allosteric sites may be missed by PARS, the chance that a positive prediction will be correct is very high if the site is detected by LIGSITEcs.
1 includes 27 proteins for which none of the 8 (total of 216) predicted cavities match the allosteric site. Identifying allosteric sites as ligand-binding sites is the initial step of PARS, required to further evaluate cavities in terms of structural conservation and flexibility effects. This is not trivial, as explained above (see Limitations, section 5.1).
A ligand-binding site or cavity which is not predicted by LIGSITEcs may nevertheless be occupied by a cocrystallyzed ligand or identified previously by the user using another software (e.g. FPOCKET [Guilloux et al., 2009] or CONCAVITY [Capra et al., 2009]). These cavities may be also selected (cocrystallized compounds) or introduced (cavities detected by a different software) by the user for their inclusion in the analysis, as explained in the section on custom ligands (1.5).
To illustrate the performance of PARS in such cases (i.e. the allosteric site is included among the set of cavities that will be evaluated), the 27 cases where no LIGSITEcs prediction matched the allosteric site where filtered out from the benchmark and the results are displayed in Table 2. As expected, PPV increases reaching in the best case 0.79.
Results on this table refer to the subset of 75 proteins for which LIGSITEcs was able to predict a ligand-binding pocket in the position of the allosteric site. TP: true positive; TN: true negative; FP: false positive; FN: false negative; PPV: positive predictive value. Sensitivity: TP/(TP+FN); specificity: TN/(TN+FP); accuracy: (TP+TN)/(TP+FN+TN+FP); PPV or precision: TP/(TP+FP). F corresponds to sets including a change in flexibility as selection criterion; S corresponds to sets including high structural conservation as selection criterion; c123 refers to sets considering only the three largest pockets predicted by LIGSITEcs; c1 refers to sets considering only the largest predicted pocket.