XGboost-Ampy:  Identification of AMPylation Protein Function Prediction Using Machine Learning

Zar Nawab Khan Swati; Ali Ghulam; Muhammad Sohail; Jawad Usman Arshed; Rahu Sikander; Muhammad Shahid Malik; Nauman khan

doi:10.21015/vtcs.v10i2.1347

Authors

Zar Nawab Khan Swati Department of Computer Sciences, Karakoram International University Gilgit-Baltistan
Ali Ghulam Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
Muhammad Sohail Department of Computer Science, Sarhad University of Science and Information Technology, Mardan Campus, Khyber Pakhtunkhwa, Pakistan
Jawad Usman Arshed Department of computer science, University of Baltistan Skardu
Rahu Sikander Department of computer science, University of Baltistan Skardu, Pakistan
Muhammad Shahid Malik Department of Computer Sciences, Karakoram International University Gilgit-Baltistan, Pakistan
Nauman khan COMSATS University, Abbottabad, Pakistan

DOI:

https://doi.org/10.21015/vtcs.v10i2.1347

Abstract

A developing post-translational modification known as AMPylation involves the formation of a phosphodiester bond on the hydroxyl group of threonine, serine, or tyrosine. Adenosine monophosphate is covalently attached to the side chain of an amino acid in a peptide during this process, which is catalyzed by AMPylation. We used AMPylation peptide sequence data from bacteria, eukaryotes, and archaea to train the models. Then, we compared the results of several feature extraction methods and their combinations in addition to classification algorithms to obtain more accurate prediction models. To prevent additional loss of sequence information, the PseAAC feature is employed to construct a fixed-size descriptor value in vector space. The basic feature set is received from 2nd features extraction method. All of this was accomplished by deriving the protein characteristics from the evolutionary data and sequence of the BLOUSM62 amino acid residue. The eXtreme Gradient Boosting (XGBoost) technique was used to create a novel model for the current study, which was then compared to the most popular machine learning models. In this research, we proposed framework for AMPylation identification that makes use of the XGBoost algorithm (AMPylation) and sequence-derived functions. XGBoost -Ampy has an accuracy of 86.7%, a sensitivity of 76.1%, a specificity of 97.5%, and a Matthews’s correlation coefficient (MCC) of 0.753 for predicting AMylation sites. XGBoost -Amp, the first machine learning model developed, has shown promise and may be able to help with this problem.

References

Brown, M. S., A. Segal, and E. R. Stadtman. "Modulation of glutamine synthetase adenylylation and deadenylylation is mediated by metabolic transformation of the PII-regulatory protein." Proceedings of the National Academy of Sciences., vol. 68, no. 12 pp. 2949-2953, 1971.

O. N. Jensen, “Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry,” Curr. Opin. Chem. Biol., vol. 8, no. 1, pp. 33–41, 2004.

Kia-Ki, Han, and Arlette Martinage. "Post-translational chemical modification (s) of proteins." International journal of biochemistry., vol. 24, no. 1, pp. 19-28, 1992.

Jensen, Ole Nørregaard. "Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry." Current opinion in chemical biology., vol. 8, no. 1, pp. 33-41, 2004.

Krishna, Radha G., and Finn Wold. "Post-translational modifications of proteins." Methods in protein sequence analysis., pp. 167-172, 1993.

Y. Zhang et al., “Quantitative proteomics reveals membrane protein-mediated hypersaline sensitivity and adaptation in halophilic Nocardiopsis xinjiangensis,” J. Proteome Res., vol. 15, no. 1, pp. 68–85, 2016.

Zhang, G., & Cvijic, M. E. “Screening and characterization of G-protein–coupled receptor ligands for drug discovery,” in Handbook of Drug Screening, CRC Press, 2016, pp. 153–202.

G. Hu et al., “A conserved mechanism of TOR-dependent RCK-mediated mRNA degradation regulates autophagy,” Nat. Cell Biol., vol. 17, no. 7, pp. 930–942, 2015.

M. L. Yarbrough and K. Orth, “AMPylation is a new post-translational modiFICation,” Nat. Chem. Biol., vol. 5, no. 6, pp. 378–379, 2009.

R. Woolery, P. Luong, C. A. Broberg, and K. Orth, “AMPylation: Something Old is New Again,” Front. Microbiol., vol. 1, p. 113, 2010.

Mullard, “Examining the fic domain: Cellular microbiology,” Nat. Rev. Microbiol., vol. 7, no. 6, pp. 405–405, 2009.

H. Ham, A. R. Woolery, C. Tracy, D. Stenesen, H. Krämer, and K. Orth, “Unfolded protein response-regulated Drosophila Fic (dFic) protein reversibly AMPylates BiP chaperone during endoplasmic reticulum homeostasis,” J. Biol. Chem., vol. 289, no. 52, pp. 36059–36069, 2014.

Brabham, Robin L. "O is for aldehyde: using pyrrolysine analogues to introduce reactive carbonyls into proteins for bioconjugation." PhD diss., University of York, 2019.

L. A. Perera and D. Ron, “AMPylation and endoplasmic reticulum protein folding homeostasis,” Cold Spring Harb. Perspect. Biol., p. a041265, 2022.

T. Becker, A. Wiest, A. Telek, D. Bejko, A. Hoffmann-Röder, and P. Kielkowski, “Transforming chemical proteomics enrichment into high-throughput method using SP2E workflow,” bioRxiv, 2022.

S. Jamal, W. Ali, P. Nagpal, A. Grover, and S. Grover, “Predicting phosphorylation sites using machine learning by integrating the sequence, structure, and functional information of proteins,” J. Transl. Med., vol. 19, no. 1, p. 218, 2021.

M. Audagnotto and M. Dal Peraro, “Protein post-translational modifications: In silico prediction tools and molecular modeling,” Comput. Struct. Biotechnol. J., vol. 15, pp. 307–319, 2017.

A. Olsen, “Expansion of the lysine acylation landscape,” Angew. Chem. Int. Ed Engl., vol. 51, no. 16, pp. 3755–3756, 2012.

Peng et al., “The first identification of lysine malonylation substrates and its regulatory enzyme,” Mol. Cell. Proteomics, vol. 10, no. 12, p. M111.012658, 2011.

M. R. Uddin, A. Sharma, D. M. Farid, M. M. Rahman, A. Dehzangi, and S. Shatabda, “EvoStruct-Sub: An accurate Gram-positive protein subcellular localization predictor using evolutionary and structural features,” J. Theor. Biol., vol. 443, pp. 138–146, 2018.

Ghulam, Sikander, R., Talpur, D.B., Saba, E., Talpur, M.S.H., Maher, Z.A. and Tunio, S., "Identifying Molecular Functions of Dynein Motor Proteins Using Extreme Gradient Boosting Algorithm With Machine Learning." Journal of Mountain Area Research, 8, 1-13. 2022.

Ghulam, Sikander, R., Talpur, D.B., Saba, E., Talpur, M.S.H., Maher, Z.A. and Tunio, S., "Identifying Molecular Functions of Dynein Motor Proteins Using Extreme Gradient Boosting Algorithm with Machine Learning." Journal of Mountain Area Research, 8, 1-13. 2022.

Ghulam, R. Sikander, F. Ali, Z. N. Khan Swati, A. Unar, and D. B. Talpur, “Accurate prediction of immunoglobulin proteins using machine learning model,” Inform. Med. Unlocked, vol. 29, no. 100885, p. 100885, 2022.

A. Ghulam, X. Lei, M. Guo, and C. Bian, “Disease-pathway association prediction based on random walks with restart and PageRank,” IEEE Access, vol. 8, pp. 72021–72038, 2020.

Ge, F., Li, C., Iqbal, S., Muhammad, A., Li, F., Thafar, M.A., Yan, Z., Worachartcheewan, A., Xu, X., Song, J. and Yu, D.J., “VPatho: a deep learning-based two-stage approach for accurate prediction of gain-of-function and loss-of-function variants,” Brief. Bioinform., vol. 24, no. 1, 2023.

Arif, M., Kabir, M., Ahmed, S., Khan, A., Ge, F., Khelifi, A. and Yu, D.J. “DeepCPPred: A deep learning framework for the discrimination of cell-penetrating peptides and their uptake efficiencies,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 19, no. 5, pp. 2749–2759, 2022.

Ghulam, F. Ali, R. Sikander, A. Ahmad, A. Ahmed, and S. Patil, “ACP-2DCNN: Deep learning-based model for improving prediction of anticancer peptides using two-dimensional convolutional neural network,” Chemometr. Intell. Lab. Syst., vol. 226, no. 104589, p. 104589, 2022.

F. Ge, A. Muhammad, and D.-J. Yu, “DeepnsSNPs: Accurate prediction of non-synonymous single-nucleotide polymorphisms by combining multi-scale convolutional neural network and residue environment information,” Chemometr. Intell. Lab. Syst., vol. 215, no. 104326, p. 104326, 2021.

Ghualm, X. Lei, Y. Zhang, S. Cheng, and M. Guo, “Identification of pathway-specific protein domain by incorporating hyperparameter optimization based on 2D convolutional neural network,” IEEE Access, vol. 8, pp. 180140–180155, 2020.

Garofalo, M., Piccoli, L., Romeo, M., Barzago, M. M., Ravasio, S., Foglierini, M., ... & Cavalli, A "Machine learning analyses of antibody somatic mutations predict immunoglobulin light chain toxicity." Nature Communications, 12, no. 1, 3532, 2021.

P. Kielkowski et al., “FICD activity and AMPylation remodelling modulate human neurogenesis,” Nat. Commun., vol. 11, no. 1, p. 517, 2020.

S. A. Sieber, S. Cappello, and P. Kielkowski, “From young to old: AMPylation hits the brain,” Cell Chem. Biol., vol. 27, no. 7, pp. 773–779, 2020.

W. Liu, X. Meng, Q. Xu, D. R. Flower, and T. Li, “Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models,” BMC Bioinformatics, vol. 7, no. 1, p. 182, 2006.

Cheng, L., Jiang, Y., Ju, H., Sun, J., Peng, J., Zhou, M., “InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk,” BMC Genomics, vol. 19, no. S1, 2018.

Y. Y. Tseng, J. Dundas, and J. Liang, “Predicting protein function and binding profile via matching of local evolutionary and geometric surface patterns,” J. Mol. Biol., vol. 387, no. 2, pp. 451–464, 2009.

Dundas, J., Adamian, L., & Liang, J. "Structural signatures of enzyme binding pockets from order-independent surface alignment: a study of metalloendopeptidase and NAD binding proteins." Journal of molecular biology 406, no. 5, 713-729, 2011.

A. Foucquier, S. Robert, F. Suard, L. Stéphan, and A. Jay, “State of the art in building modelling and energy performances prediction: A review,” Renew. Sustain. Energy Rev., vol. 23, pp. 272–288, 2013.

F. Martínez-Álvarez, A. Troncoso, G. Asencio-Cortés, and J. Riquelme, “A survey on data mining techniques applied to electricity-related time series forecasting,” Energies, vol. 8, no. 11, pp. 13162–13193, 2015.

L. Dey and A. Mukhopadhyay, “A classification-based approach to prediction of dengue virus and human protein-protein interactions using amino acid composition and conjoint triad features,” in 2019 IEEE Region 10 Symposium (TENSYMP), 2019.

L. Dey, S. Chakraborty, A. Biswas, B. Bose, and S. Tiwari, “Sentiment analysis of review datasets using Naive Bayes and K-NN classifier,” arXiv [cs.IR], 2016.

M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, and F. Herrera, “A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches,” IEEE Trans. Syst. Man Cybern. C Appl. Rev., vol. 42, no. 4, pp. 463–484, 2012.

Pal, Mahesh. "Random forest classifier for remote sensing classification." International journal of remote sensing, 26, no. 1, pp.217-222, 2005.

Zhang et al., “Urine proteome profiling predicts lung cancer from control cases and other tumors,” EBioMedicine, vol. 30, pp. 120–128, 2018.

J. Wu et al., “ATBdiscrimination: An in silico tool for identification of active tuberculosis disease based on routine blood test and T-SPOT.TB detection results,” J. Chem. Inf. Model., vol. 59, no. 11, pp. 4561–4568, 2019.

K. C. Chou, “Prediction of signal peptides using scaled window,” Peptides, vol. 22, no. 12, pp. 1973–1979, 2001.

W. Chen, P.-M. Feng, H. Lin, and K.-C. Chou, “iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition,” Nucleic Acids Res., vol. 41, no. 6, p. e68, 2013.

J. Chen, H. Liu, J. Yang, and K.-C. Chou, “Prediction of linear B-cell epitopes using amino acid pair antigenicity scale,” Amino Acids, vol. 33, no. 3, pp. 423–428, 2007.

P.-M. Feng, W. Chen, H. Lin, and K.-C. Chou, “iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition,” Anal. Biochem., vol. 442, no. 1, pp. 118–125, 2013.

S. Tan, “An effective refinement strategy for KNN text classifier,” Expert Syst. Appl., vol. 30, no. 2, pp. 290–298, 2006.

M. Pal, “Random Forest classifier for remote sensing classification,” Int. J. Remote Sens., vol. 26, no. 1, pp. 217–222, 2005.

Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.

XGboost-Ampy: Identification of AMPylation Protein Function Prediction Using Machine Learning

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information