Discrimination of SARS-COV2 virus protein strain of three major affected countries: USA, China, and Germany

Khalid Allehaibi


In this paper, we discuss the discrimination of SARS-COV2 viruses associated with three major affected countries the USA, China, and Germany. The discrimination can reveal the mutation as the result of viral transmission and its spread due to mutation associated with its protein structure which makes small changes in the Spike protein. To investigate the mutation in SARS-COV2, we downloaded the protein strains associated with the USA, China, and Germany from the UniProtKB by advance search through SARS-COV2, country name, and protein name: Accessory protein 7b, 6, ORF3a, 10, 8 protein, Envelope small membrane protein, Nucleoprotein, Membrane protein, Spike glycoprotein, 3C-like proteinase, and 2'-O-methyltransferase. After retrieving the protein sequences, we transform the biological form of sequences to their equivalent numerical form by using statistical moments. Further classification algorithms like Random Forest, SVM are used for their training and classification. Finally, performance evaluation is carried out using K-fold cross-validation, independent testing, self-consistency, and jackknife testing. The result received through all testing is more than 97%, which shows the visible discrimination among the protein strains of mentioned countries, which shows the strong mutation in SARS-Cov2 sequences.

Full Text:



. Tang X., Wu C., Li X., Song Y., Yao X., Wu X., Duan Y., Zhang H., Wang Y., Qian Z. On the origin and continuing evolution of SARS-CoV-2. Natl. Sci. Rev. 2020;7(6):1012–1023. [Google Scholar] [Ref list]

. Zhang Y.-Z., Holmes E.C. A genomic perspective on the origin and emergence of SARS-CoV-2. Cell. 2020;181(2):223–227. [PMC free article] [PubMed] [Google Scholar] [Ref list]

. Zhu, N. et al. A novel coronavirus from patients with pneumonia in China, 2019. N. Eng. J. Med. 382(8), 727–733 (2020)

. Islam M.R., Hoque M.N., Rahman M.S., Alam A.R.U., Akther M., Puspo J.A., Akter S., Sultana M., Crandall K.A., Hossain M.A. Genome-wide analysis of SARS-CoV-2 virus strains circulating worldwide implicates heterogeneity. Sci. Rep. 2020;10:1–9. [PMC free article] [PubMed] [Google Scholar] [Ref list]

. Li Y., Yang X., Wang N., Wang H., Yin B., Yang X., Jiang W. The divergence between SARS-CoV-2 and RaTG13 might be overestimated due to the extensive RNA modification. Futur. Virol. 2020;15(6):341–347.

. Rahman M.S., Islam M.R., Hoque M.N., Alam A.R.U., Akther M., Puspo J.A., Akter S., Anwar A., Sultana M., Hossain M.A. Comprehensive annotations of the mutational spectra of SARS-CoV-2 spike protein: a fast and accurate pipeline. Transbound. Emerg. Dis. 2020:1–13. (2020; 00) [PMC free article] [PubMed] [Google Scholar] [Ref list]

. Walls, A. C. et al. Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell 180, 1–12 (2020).

. Ahmed, S. F., Quadeer, A. A. & McKay, M. R. Preliminary identification of potential vaccine targets for the COVID-19 coronavirus (SARS-CoV-2) based on SARS-CoV immunological studies. Viruses 12(3), 254 (2020).

. WHO Coronavirus disease (COVID-19) Weekly Epidemiological Update and Weekly Operational Update

https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports dated April 10, 2021

. Phan, T. Genetic diversity and evolution of SARS-CoV-2. Infect. Genet. Evol. 81, 104260 (2020).

. Sardar, R., Satish, D., Birla, S. & Gupta, D. Comparative analyses of SAR-CoV2 genomes from different geographical locations and other coronavirus family genomes reveals unique features potentially consequential to host-virus interaction and pathogenesis. bioRxiv (2020).

. Armijos-Jaramillo, V., Yeager, J., Muslin, C. & Perez-Castillo, Y. SARS-CoV-2, an evolutionary perspective of interaction with human ACE2 reveals undiscovered amino acids necessary for complex stability. bioRxiv (2020).

. Kastenmayer JP, Ni L, Chu A, Kitchen LE, Au WC, Yang H, Carter CD, Wheeler D, Davis RW, Boeke JD, Snyder MA, Basrai MA. Functional genomics of genes with small open reading frames (sORFs) in S. cerevisiae. Genome Res. 2006;16(3):365–373. [PMC free article] [PubMed] [Google Scholar]

. Basrai MA, Hieter P, Boeke JD. Small open reading frames: beautiful needles in the haystack. Genome Res. 1997;7(8):768–771. [PubMed] [Google Scholar]

. Sharp PM, Li WH. The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res. 1987;15(3):1281–1295. [PMC free article] [PubMed] [Google Scholar]

. Ghaemmaghami S, Huh WK, Bower K, Howson RW, Belle A, Dephoure N, O'Shea EK, Weissman JS. Global analysis of protein expression in yeast. Nature. 2003;425(6959):737–741. [PubMed] [Google Scholar]

. Roy, S., Martinez, D., Platero, H., Lane, T., & Werner-Washburne, M. (2009). Exploiting amino acid composition for predicting protein-protein interactions. PloS one, 4(11), e7813.

. Wang, Y., Zhang, Q., Sun, M. A., & Guo, D. (2011). High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics, 27(6), 777-784.

. Zhou, X. B., Chen, C., Li, Z. C., & Zou, X. Y. (2007). Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of theoretical biology, 248(3), 546-551.

. Chen, C., Shen, Z. B., & Zou, X. Y. (2012). Dual-layer wavelet SVM for predicting protein structural class via the general form of Chou's pseudo amino acid composition. Protein and peptide letters, 19(4), 422-429.

. Chou, K. C. (2009). Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics, 6(4), 262-274.

. Shen, H. B., & Chou, K. C. (2008). PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Analytical biochemistry, 373(2), 386-388.

. Akmal, M. A., Hussain, W., Rasool, N., Khan, Y. D., Khan, S. A., & Chou, K. C. (2020). Using Chou's 5-steps rule to predict O-linked serine glycosylation sites by blending position relative features and statistical moment. IEEE/ACM transactions on computational biology and bioinformatics.

. Shah, A. A., & Khan, Y. D. (2020). Identification of 4-carboxyglutamate residue sites based on position based statistical feature and multiple classification. Scientific Reports, 10(1), 1-10.

. Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A., & Chou, K. C. (2019). SPalmitoylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins. Analytical biochemistry, 568, 14-23.

. Khan, Y. D., Rasool, N., Hussain, W., Khan, S. A., & Chou, K. C. (2018). iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Analytical biochemistry, 550, 109-116.

. Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A., & Chou, K. C. (2019). SPrenylC-PseAAC: A sequence-based model developed via Chou's 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins. Journal of theoretical biology, 468, 1-11.

. Awais, M., Hussain, W., Khan, Y. D., Rasool, N., Khan, S. A., & Chou, K. C. (2019). iPhosH-PseAAC: Identify phosphohistidine sites in proteins by blending statistical moments and position relative features according to the Chou's 5-step rule and general pseudo amino acid composition. IEEE/ACM transactions on computational biology and bioinformatics.

. Khan, Yaser Daanial, Mehreen Jamil, Waqar Hussain, Nouman Rasool, Sher Afzal Khan, and Kuo-Chen Chou. "pSSbond-PseAAC: Prediction of disulfide bonding sites by integration of PseAAC and statistical moments." Journal of theoretical biology 463 (2019): 47-55.

. Khan, S., Khan, M., Iqbal, N., Hussain, T., Khan, S. A., & Chou, K. C. (2019). A two-level computation model based on deep learning algorithm for identification of piRNA and their functions via Chou’s 5-steps rule. International Journal of Peptide Research and Therapeutics, 1-15.

. A. H. Butt, N. Rasool, and Y. D. Khan, “A Treatise to Computational Approaches Towards Prediction of Membrane Protein and Its Subtypes,” J. Membr. Biol., vol. 250, no. 1, pp. 55–76, Feb. 2017, doi: 10.1007/s00232-016-9937-7.

. A. H. Butt, N. Rasool, and Y. D. Khan, “Predicting membrane proteins and their types by extracting various sequence features into Chou’s general PseAAC,” Mol. Biol. Rep., vol. 45, no. 6, pp. 2295–2306, Dec. 2018, doi: 10.1007/s11033-018-4391-5.

. A. H. Butt, N. Rasool, and Y. D. Khan, “Prediction of antioxidant proteins by incorporating statistical moments based features into Chou’s PseAAC,” J. Theor. Biol., vol. 473, pp. 1–8, Jul. 2019, doi: 10.1016/j.jtbi.2019.04.019.

. Q. Dai, S. Ma, Y. Hai, Y. Yao, and X. Liu, “A segmentation based model for subcellular location prediction of apoptosis protein,” Chemom. Intell. Lab. Syst., vol. 158, pp. 146–154, Nov. 2016, doi: 10.1016/j.chemolab.2016.09.005.

. M. K. & M. Hayat, “iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples,” Mol Genet Genomics, vol. 291, pp. 285–296 (2016)., 2016, doi: 10.1007/s00438-015-1108-5.

. FarmanAli, MaqsoodHayat, “Classification of membrane protein types using Voting Feature Interval in combination with Chou׳s Pseudo Amino Acid Composition,” J. Theor. Biol., vol. 384, no. 7, pp. 78–83, 2015, doi: 10.1016/j.jtbi.2015.07.034.

. Abhishek Sharma, “Decision Tree vs. Random Forest – Which Algorithm Should you Use?” Retrived: https://www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/ June September 2020.

. Chauhan, A., Chauhan, D., & Rout, C. (2014). Role of gist and PHOG features in computer-aided diagnosis of tuberculosis without segmentation. PloS one, 9(11), e112980.

. Ren L. L., Wang Y. M., Wu Z. Q., Xiang Z. C., Guo L., Xu T., et al. (2020). Identification of a novel coronavirus causing severe pneumonia in human: a descriptive study. Chin. Med. J. 133 1015–1024. 10.1097/CM9.0000000000000722

. Choudhry H, Bakhrebah MA, Abdulaal WH, Zamzami MA, Baothman OA, Hassan MA, Zeyadi M, Helmi N, Alzahrani F, Ali A, Zakaria MK, Kamal MA, Warsi MK, Ahmed F, Rasool M, Jamal MS Future Virol. 2019 Apr; 14(4):237-246.

. Brian D. A., Baric R. S. (2005). Coronavirus genome structure and replication. Curr. Topics Microbiol. Immunol. 287, 1–30. doi: 10.1007/3-540-26765-4_1

. Jin, Y., Yang, H., Ji, W., Wu, W., Chen, S., Zhang, W., & Duan, G. (2020). Virology, epidemiology, pathogenesis, and control of COVID-19. Viruses, 12(4), 372.


. Sars Cov 2 Virus Genome, https://centri.onrender.com/sars-cov-2-virus-genome.html Retreived April 10, 2021

. Majchrzykiewicz-Koehorst, J. A., Heikens, E., Trip, H., Hulst, A. G., de Jong, A. L., Viveen, M. C., ... & Paauw, A. (2015). Rapid and generic identification of influenza A and other respiratory viruses with mass spectrometry. Journal of virological methods, 213, 75-83.

. Randhawa, G. S., Soltysiak, M. P., El Roz, H., de Souza, C. P., Hill, K. A., & Kari, L. (2020). Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. Plos one, 15(4), e0232391.

. Wang, R., Chen, J., Gao, K., Hozumi, Y., Yin, C., & Wei, G. W. (2020). Characterizing SARS-CoV-2 mutations in the United States. arXiv preprint arXiv:2007.12692.

. Elizondo, V., Harkins, G. W., Mabvakure, B., Smidt, S., Zappile, P., Marier, C., ... & Duerr, R. (2021). SARS-CoV-2 genomic characterization and clinical manifestation of the COVID-19 outbreak in Uruguay. Emerging microbes & infections, 10(1), 51-65.

. Azad, G. K. (2021). Identification and molecular characterization of mutations in nucleocapsid phosphoprotein of SARS-CoV-2. PeerJ, 9, e10666.

. Khan, Y. D., & Roomi, M. S. (2020). Promising compounds for treatment of Covid-19. VAWKUM Trans. Comput. Sci, 17(1), 1-8.

. Hassan, S. A. (2016). Comparative Computational Analysis of a Putative Transcriptional Regulator Map_PRSO3010 and its implications in the Pathogenesis of Crohn’s and Johne’s diseases. VAWKUM Transactions on Computer Sciences, 4(1), 60-77.

. Hassan, S. A., & Tayubi, I. A. (2017). Computational Approaches to Identify a Derivative of Galardin as an Inhibitor of Mycobacterial Peptide Deformylase. VAWKUM Transactions on Computer Sciences, 5(1), 45-55.

. Ullah, F., & Khan, I. (2014). Bnmps: Biomolecular nanomachine protocol stack for human disease diagnoses: A new paradigm. VAWKUM Transactions on Computer Sciences, 2(1), 96-106.

. D. S. Cao, Q. S. Xu, and Y. Z. Liang, “Propy: A tool to generate various modes of Chou’s PseAAC,” Bioinformatics, vol. 29, no. 7, pp. 960–962, 2013, doi: 10.1093/bioinformatics/btt072.

. P. Tripathi and P. N. Pandey, “A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou’s pseudo amino acid composition,” J. Theor. Biol., vol. 424, pp. 49–54, 2017, doi: 10.1016/j.jtbi.2017.04.027.

. F. Javed and M. Hayat, “Predicting subcellular localization of multi-label proteins by incorporating the sequence features into Chou’s PseAAC,” Genomics, no. September, pp. 0–1, 2018, doi: 10.1016/j.ygeno.2018.09.004.

. L. Zhang and L. Kong, “iRSpot-ADPM: Identify recombination spots by incorporating the associated dinucleotide product model into Chou’s pseudo components,” J. Theor. Biol., vol. 441, pp. 1–8, 2018, doi: 10.1016/j.jtbi.2017.12.025.

. Albugami, N. (2020). Prediction of Saudi Arabia SARS-COV 2 Diversifications in Protein Strain Against China Strain. VAWKUM Transactions on Computer Sciences, 8(1), 64-73.

. Hassan, S. A., Khan, T., & Hashmi, A. (2016). Computational Approach to Design Antagonists of Mycobacterium Tuberculosis Lipoprotein Lprg (RV1411C) Protein. VAWKUM Transactions on Computer Sciences, 4(1), 44-50.

. C. Huang and J. Q. Yuan, “Predicting protein subchloroplast locations with both single and multiple sites via three different modes of Chou’s pseudo amino acid compositions,” J. Theor. Biol., vol. 335, no. 0022, pp. 205–212, 2013, doi: 10.1016/j.jtbi.2013.06.034.

. K. C. Chou, “Some remarks on protein attribute prediction and pseudo amino acid composition,” J. Theor. Biol., vol. 273, no. 1, pp. 236–247, 2011, doi: 10.1016/j.jtbi.2010.12.024.

. K. C. Chou, “Prediction of protein cellular attributes using pseudo-amino acid composition,” Proteins Struct. Funct. Genet., vol. 43, no. 3, pp. 246–255, 2001, doi: 10.1002/prot.1035.

. X. Fu, W. Zhu, B. Liao, L. Cai, L. Peng et al., “Improved DNA-Binding protein identification by incorporating evolutionary information into the Chou’s PseAAC,” IEEE Access, vol. 6, pp. 66545–66556, 2018, doi: 10.1109/ACCESS.2018.2876656.

. J. Jia, Z. Liu, X. Xiao, B. Liu, and K. C. Chou, “pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach,” J. Theor. Biol., vol. 394, pp. 223–230, 2016, doi: 10.1016/j.jtbi.2016.01.020.

. Y. D. Khan, F. Ahmed, and S. A. Khan, “Situation recognition using image moments and recurrent neural networks,” Neural Comput. Appl., vol. 24, no. 7–8, pp. 1519–1529, 2014, doi: 10.1007/s00521-013-1372-4.

. W. Hussain, Y. D. Khan, N. Rasool, S. A. Khan, and K. C. Chou, “SPrenylC-PseAAC: A sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins,” J. Theor. Biol., vol. 468, pp. 1–11, 2019, doi: 10.1016/j.jtbi.2019.02.007.

DOI: http://dx.doi.org/10.21015/vtcs.v9i1.1000


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.