A comprehensive tool for rapid and accurate prediction of kinase-specific phosphorylation sites in the human proteome

The generic workflow of Quokka is shown in Fig. 1A. There are several steps to perform kinase-regulated phosphorylation prediction. First, users need to provide Quokka proteins of interest in the FASTA format. An ‘Example’ link has been provided to assist users with the acceptable input format. Note that, to guarantee prediction efficiency, we allow users to submit no more than 1,000 sequences at a time. In the second step, users will need to select a specific kinase family for phosphorylation prediction. In total, Quokka provides 45 serine/threonine and 22 threonine kinase families. In the third step, a scoring function/model will be selected for Quokka to employ. Based on our experimental results (refer to the Supplementary Results), the logistic regression models performed best amongst all scoring functions. However, the sequence scoring functions usually return the prediction result more rapidly than the logistic regression model. Therefore, the selection of appropriate scoring functions is at the users’ discretion base on their computational requirement and complexity. In the last step, users can choose the number of top-ranking predicted phosphorylation sites (N; N=1, 3, 5, 10 and 20) to be displayed in the result webpage. An example of a typical input for Quokka is demonstrated in Fig. 1B.



Fig. 1. (A) The overall workflow of Quokka (using logistic regression model as an example); (B) The web interface of Quokka for sequence submission and (C) The prediction results of Quokka webserver for two proteins (UniProt IDs: P46527 and O60343) of AGC/AKT kinase family.



A strength of Quokka is that it can rapidly return the prediction result, thereby facilitating high-throughput sequence prediction. Our test suggested that, in average, it takes eight minutes to finish the process of 1,000 protein sequences and return the prediction results. This efficiency is a result of Quokka not calculating high-dimensional feature sets for the input, unlike most other machine-learning based methods. After Quokka completes the prediction, the outcomes of all submitted sequences will be returned to the result webpage. As shown in Fig. 1C, each result table contains the prediction scores for each protein. To comprehensively demonstrate the prediction results, six sortable columns, including ‘Rank’, ‘Position’, ‘Site’, ‘Motif’, ‘Score’ and ‘Kinase family’, have been provided within each table. All the prediction results can be easily exported to widely used file formats, including CSV, Excel spreadsheet, and PDF.