Computational Identification of Lungs Cancer Causing Genes by Machine Learning (Ml) Classifiers

Muntaha Saleem, Muhammad Sohaib Akram, Seher Ansar Khawaja


Molarity rate is increasing day by day at all over the world among both genders due to the increasing rate of lung cancer. It is a dangerous disease and usually it starts when unrestrained growth of abnormal cells start growing in lungs. The early detection of this disease has been a major challenge in the past hence, to overcome this issue many detection techniques have introduced over the time. In last decade, many Machine Learning classifiers have developed and adopted for the detection of lungs cancer. In this study, we have utilized six ML classifier such as ‘Support Vector Machine ‘(SVM) ‘K-Nearest Neighbor’ (KNN), Adaboost, ‘Conventional Neural Network’ (CNN), Xgboost and Naïve Bayes for the detection of lungs cancer causing genes. We have collected dataset from publicly available intoGene browser. This dataset consists of 2193 genes in which both tumor and non-tumor genes are included. To find, which classifier provide high accuracy of lungs cancer detection as well as lungs cancer causing genes, this study have used the above-mentioned ML classifiers and found that CNN proved to be the best classifier with 86 percent accuracy among all classifiers.

Full Text:



D. N. Ganesan, D. K. Venkatesh, D. M. A. Rama, and A. M. Palani, “Application of Neural Networks in Diagnosing Cancer Disease using Demographic Data,” Int. J. Comput. Appl., vol. 1, no. 26, pp. 81–97, 2010, doi: 10.5120/476-783.

F. Hosseinzadeh, A. H. Kayvanjoo, and M. Ebrahimi, “Prediction of lung tumor types based on protein attributes by machine learning algorithms,” Springerplus, vol. 2, no. 1, pp. 1–14, 2013, doi: 10.1186/2193-1801-2-238.

B. J. M. Webb-Robertson et al., “A support vector machine model for the prediction of proteotypic peptides for accurate mass and time proteomics,” Bioinformatics, vol. 26, no. 13, pp. 1677–1683, 2010, doi: 10.1093/bioinformatics/btq251.

W. D. Travis, “WHO-Klassifikation des Bronchialkarzinoms 2015,” Pathologe, vol. 35, no. 2, p. 188, 2014, doi: 10.1007/s00292-014-1974-3.

J. Khan et al., “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nat. Med., vol. 7, no. 6, pp. 673–679, 2001, doi: 10.1038/89044.

J. M. Bishop, “Molecular themes in oncogenesis,” Cell, vol. 64, no. 2, pp. 235–248, 1991, doi: 10.1016/0092-8674(91)90636-D.

Z. Xing, C. Chu, L. Chen, and X. Kong, “The use of Gene Ontology terms and KEGG pathways for analysis and prediction of oncogenes,” Biochim. Biophys. Acta - Gen. Subj., vol. 1860, no. 11, pp. 2725–2734, 2016, doi: 10.1016/j.bbagen.2016.01.012.

C. M. Croce, “Oncogenes and cancer. supplementary appendix,” N. Engl. J. Med., vol. 358, no. 5, pp. 502–11, 2008, [Online]. Available:

H. Zur Hausen, “Oncogenic DNA viruses,” Oncogene, vol. 20, no. 54 REV. ISS. 7, pp. 7820–7823, 2001, doi: 10.1038/sj/onc/1204958.

F. Taher, N. Werghi, H. Al-Ahmad, and R. Sammouda, “Lung Cancer Detection by Using Artificial Neural Network and Fuzzy Clustering Methods,” Am. J. Biomed. Eng., vol. 2, no. 3, pp. 136–142, 2012, doi: 10.5923/j.ajbe.20120203.08.

K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Comput. Struct. Biotechnol. J., vol. 13, pp. 8–17, 2015, doi: 10.1016/j.csbj.2014.11.005.

S. Gokhale, “Ultrasound characterization of breast masses,” Indian J. Radiol. Imaging, vol. 19, no. 3, pp. 242–247, 2009, doi: 10.4103/0971-3026.54878.

J. Tang, R. M. Rangayyan, J. Xu, I. E. El Naqa, and Y. Yang, “Computer-aided detection and diagnosis of breast cancer with mammography: Recent advances,” IEEE Trans. Inf. Technol. Biomed., vol. 13, no. 2, pp. 236–251, 2009, doi: 10.1109/TITB.2008.2009441.

S. Sharma and S. Deshpande, “Breast Cancer Classification Using Machine Learning Algorithms,” Lect. Notes Networks Syst., vol. 141, pp. 571–578, 2021, doi: 10.1007/978-981-15-7106-0_56.

J. Alam, S. Alam, and A. Hossan, “Multi-Stage Lung Cancer Detection and Prediction Using Multi-class SVM Classifie,” Int. Conf. Comput. Commun. Chem. Mater. Electron. Eng. IC4ME2 2018, no. February, pp. 1–4, 2018, doi: 10.1109/IC4ME2.2018.8465593.

ERKAN EMİRZADE, “A Computer Aided Diagnosis System for Lung Cancer Detection Using SVM,” vol. 2, no. 1, pp. 137–142, 2016.

N. Bhatia and Vandana, “Survey of Nearest Neighbor Techniques,” vol. 8, no. 2, pp. 302–305, 2010, [Online]. Available:

D. Delen, G. Walker, and A. Kadam, “Predicting breast cancer survivability: A comparison of three data mining methods,” Artif. Intell. Med., vol. 34, no. 2, pp. 113–127, 2005, doi: 10.1016/j.artmed.2004.07.002.

D. Delen, “Analysis of cancer data: A data mining approach,” Expert Syst., vol. 26, no. 1, pp. 100–112, 2009, doi: 10.1111/j.1468-0394.2008.00480.x.

V. Krishnaiah, D. Narsimha, and D. Chandra, “Diagnosis of lung cancer prediction system using data mining classification techniques,” Int. J. Comput. Sci. Inf. Technol., vol. 4, no. 1, pp. 39–45, 2013.

G. Dimitoglou, J. A. Adams, and C. M. Jim, “Comparison of the C4.5 and a Naive Bayes Classifier for the Prediction of Lung Cancer Survivability,” pp. 1–9, 2012, [Online]. Available:

N. V. R. Murty and M. S. P. Babu, “A Critical Study of Classification Algorithms for LungCancer Disease Detection and Diagnosis,” Int. J. Comput. Intell. Res., vol. 13, no. 5, pp. 1041–1048, 2017.

P. R. Radhika, R. A. S. Nair, and G. Veena, “A Comparative Study of Lung Cancer Detection using Machine Learning Algorithms,” Proc. 2019 3rd IEEE Int. Conf. Electr. Comput. Commun. Technol. ICECCT 2019, pp. 1–4, 2019, doi: 10.1109/ICECCT.2019.8869001.

Institute of Engineering & Management, University of Engineering & Management, Institute of Electrical and Electronics Engineers. Kolkata Section, and Institute of Electrical and Electronics Engineers, “Optronix 2019 : 2019 International Conference on Opto-Electronics and Applied Optics (Optronix) : 18th-20th March, 2019, University of Engineering and Management, Kolkata,” 2019 Int. Conf. Opto-Electronics Appl. Opt., pp. 1–5, 2019.

M. D. Podolsky, A. A. Barchuk, V. I. Kuznetcov, N. F. Gusarova, V. S. Gaidukov, and S. A. Tarakanov, “Evaluation of machine learning algorithm utilization for lung cancer classification based on gene expression levels,” Asian Pacific J. Cancer Prev., vol. 17, no. 2, pp. 835–838, 2016, doi: 10.7314/APJCP.2016.17.2.835.

Saeed, S.; Mahmood, M. K.; Khan, Y. D., An exposition of facial expression recognition techniques. Neural Computing and Applications 2018, 29 (9), 425-443.

Butt, A. H.; Khan, Y. D., CanLect-Pred: A cancer therapeutics tool for prediction of target cancerlectins using experiential annotated proteomic sequences. IEEE Access 2019, 8, 9520-9531.

Amanat, S.; Ashraf, A.; Hussain, W.; Rasool, N.; Khan, Y. D., Identification of lysine carboxylation sites in proteins by integrating statistical moments and position relative features via general PseAAC. Current Bioinformatics 2020, 15 (5), 396-407.

Ilyas, S., Hussain, W., Ashraf, A., Khan, Y. D., Khan, S. A., & Chou, K. C. (2019). iMethylK-PseAAC: Improving accuracy of lysine methylation sites identification by incorporating statistical moments and position relative features into general PseAAC via Chou’s 5-steps rule. Current Genomics, 20(4), 275-292.

Hussain, W.; Rasool, N.; Khan, Y. D., A Sequence-Based Predictor of Zika Virus Proteins Developed by Integration of PseAAC and Statistical Moments. Combinatorial chemistry & high throughput screening 2020, 23 (8), 797-804.

Khan, Y. D.; Alzahrani, E.; Alghamdi, W.; Ullah, M. Z., Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule. Current Bioinformatics 2020, 15 (9), 1046-1055.

Mahmood, M. K.; Ehsan, A.; Khan, Y. D.; Chou, K.-C., iHyd-LysSite (EPSV): Identifying Hydroxylysine Sites in Protein Using Statistical Formulation by Extracting Enhanced Position and Sequence Variant Feature Technique. Current Genomics 2020, 21 (7), 536-545.

Naseer, S.; Hussain, W.; Khan, Y. D.; Rasool, N., IPhosS (Deep)-PseAAC: Identify phosphoserine sites in proteins using deep learning on general pseudo amino acid compositions via modified 5-Steps rule. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2020.

Naseer, S.; Hussain, W.; Khan, Y. D.; Rasool, N., Sequence-based identification of arginine amidation sites in proteins using deep representations of proteins and PseAAC. Current Bioinformatics 2020, 15 (8), 937-948.

Shah, A. A.; Khan, Y. D., Identification of 4-carboxyglutamate residue sites based on position based statistical feature and multiple classification. Scientific Reports 2020, 10 (1), 1-10.

Awais, M.; Hussain, W.; Rasool, N.; Khan, Y. D., iTSP-PseAAC: Identifying Tumor Suppressor Proteins by Using Fully Connected Neural Network and PseAAC. Current Bioinformatics 2021, 16 (5), 700-709.

Hussain, W.; Rasool, N.; Khan, Y. D., Insights into Machine Learning-based approaches for Virtual Screening in Drug Discovery: Existing strategies and streamlining through FP-CADD. Current Drug Discovery Technologies 2021, 18 (4), 463-472.

Khan, Y. D.; Khan, N. S.; Naseer, S.; Butt, A. H., iSUMOK-PseAAC: prediction of lysine sumoylation sites using statistical moments and Chou’s PseAAC. PeerJ 2021, 9, e11581.

Malebary, S. J.; Khan, R.; Khan, Y. D., ProtoPred: Advancing Oncological Research Through Identification of Proto-Oncogene Proteins. IEEE Access 2021, 9, 68788-68797.

Malebary, S. J.; Khan, Y. D., Evaluating machine learning methodologies for identification of cancer driver genes. Scientific reports 2021, 11 (1), 1-13.

Malebary, S. J.; Khan, Y. D., Identification of Antimicrobial Peptides Using Chou's 5 Step Rule. CMC-COMPUTERS MATERIALS & CONTINUA 2021, 67 (3), 2863-2881.

Naseer, S.; Ali, R. F.; Khan, Y. D.; Dominic, P., iGluK-Deep: computational identification of lysine glutarylation sites using deep neural networks with general pseudo amino acid compositions. Journal of Biomolecular Structure and Dynamics 2021, 1-14.

Naseer, S.; Hussain, W.; Khan, Y. D.; Rasool, N., NPalmitoylDeep-PseAAC: A Predictor of N-Palmitoylation Sites in Proteins Using Deep Representations of Proteins and PseAAC via Modified 5-Steps Rule. Current Bioinformatics 2021, 16 (2), 294-305.

Naseer, S.; Hussain, W.; Khan, Y. D.; Rasool, N., Optimization of serine phosphorylation prediction in proteins by comparing human engineered features and deep representations. Analytical Biochemistry 2021, 615, 114069.

Khanum, S., Ashraf, M. A., Karim, A., Shoaib, B., Khan, M. A., Naqvi, R. A., ... & Alswaitti, M. Gly-LysPred: Identification of Lysine Glycation Sites in Protein Using Position Relative Features and Statistical Moments via Chou’s 5 Step Rule.

Lv, H., Dao, F. Y., Zhang, D., Yang, H., & Lin, H. (2021). Advances in mapping the epigenetic modifications of 5‐methylcytosine (5mC), N6‐methyladenine (6mA), and N4‐methylcytosine (4mC). Biotechnology and Bioengineering.

Zulfiqar, H., Sun, Z. J., Huang, Q. L., Yuan, S. S., Lv, H., Dao, F. Y., ... & Li, Y. W. (2021). Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli. Methods.

Liu, Y., Wang, X., & Liu, B. (2019). A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction. Briefings in bioinformatics, 20(1), 330-346.

Zhang, D., Xu, Z. C., Su, W., Yang, Y. H., Lv, H., Yang, H., & Lin, H. (2021). iCarPS: a computational tool for identifying protein carbonylation sites by novel encoded features. Bioinformatics, 37(2), 171-177.



  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.