Advancing NLP for Shahmukhi Punjabi: Word Embedding and Text Classification with a Novel Dataset

Authors

DOI:

https://doi.org/10.21015/vtcs.v13i1.2093

Abstract

The Punjabi language occupies a large pool in today’s era; millions speak it. Only in Pakistan, 80 million people speak the Punjabi language in the province of Punjab. However, with this Big Market Cap, there has yet to be any proper research available. This research focuses on the Punjabi Language, especially the Shahmukhi Punjabi language, famous in Pakistan and Asia. However, it needs to be given more attention in the existing research. There needs to be an adequately supervised established dataset available with large data. Till now, there has yet to be any proper research on Word Embedding and Classification. This paper introduced the crafted dataset for the Shahmukhi Punjabi language dataset. It is also based on advanced NLP techniques like Word2Vec and SDfasttext for Word Embedding to capture the semantic relation within the language. In addition, we investigated the applications of six distinct classification models to analyze four different categories: News, Ghazal, Dohra, and Poetry. The notable success of the Naive Bayes classifier with other classification models lays the groundwork for future research and applications in natural language processing for the Punjabi language. The study encourages further exploration and the development of tailored solutions to meet the linguistic diversity in digital environments and apply deep learning models.

References

E. Hasan, M. M. Iqbal, et al., "An online Punjabi Shahmukhi lexical resource," Sci. Int. Lahore, 2015.

V. Goyal, G. S. Lehal, "Comparative study of Hindi and Punjabi language scripts," Nepal. Linguist., 2008.

M. A. Hashmi, M. A. Mahmood, and M. I. Mahmood, "Analysis of lecxico-semantic relations of Punjabi Shahmukhi nouns: A corpus based study," Int. J. Engl. Linguist., vol. 13, 2019.

H. Singh, "Analyzing the Punjabi Language Stemmers: A Critical Approach," in ISIC, 2021, pp. 250-258. [Online]. Available: https://ceur-ws.org/Vol-2786/Paper33.pdf

M. F. Arslan, M. A. Mahmood, M. Shoaib, S. Idrees, and Z. Tariq, "Morphological Description Of Nouns In Shahmukhi Punjabi; A Corpus Based Study," J. Posit. Sch. Psychol., pp. 1259–1269, 2023.

M. T. Ahmad et al., "Named Entity Recognition and Classification for Punjabi Shahmukhi," ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 19, no. 4, pp. 1–13, Jul. 2020. doi: https://doi.org/10.1145/3383306

A. L. Bowden, "Punjabi tonemics and the Gurmukhi script: A preliminary study," Brigham Young University, [Online]. Available: https://some-link.com, Accessed: Jan. 23, 2024.

S. J. Johnson, M. R. Murty, and I. Navakanth, "A detailed review on word embedding techniques with emphasis on word2vec," Multimed. Tools Appl., Oct. 2023. doi: https://doi.org/10.1007/s11042-023-17007-z

S. Selva Birunda and R. Kanniga Devi, "A Review on Word Embedding Techniques for Text Classification," in Innovative Data Communication Technologies and Application, vol. 59, J. S. Raj, A. M. Iliyasu, R. Bestak, and Z. A. Baig, Eds., Lecture Notes on Data Engineering and Communications Technologies, vol. 59, Singapore: Springer Singapore, 2021, pp. 267-281.

V. P. Singh and P. Kumar, "Sense disambiguation for Punjabi language using supervised machine learning techniques," Sādhanā, vol. 44, no. 11, p. 226, Nov. 2019. doi: https://doi.org/10.1007/s12046-019-1206-x

Y. Li and T. Yang, "Word Embedding for Understanding Natural Language: A Survey," in Guide to Big Data Applications, vol. 26, S. Srinivasan, Ed., Studies in Big Data, vol. 26, Cham: Springer International Publishing, 2019, pp. 83–104.

S. S. Sahu, D. Dutta, S. Pal, and I. Rasheed, "Effect of Stopwords and Stemming Techniques in Urdu IR," SN Comput. Sci., vol. 4, no. 5, p. 547, Jul. 2023. doi: 10.1007/s42979-023-01953-4

E. Loper and S. Bird, "NLTK: The Natural Language Toolkit," arXiv, May 17, 2002. [Online]. Available: http://arxiv.org/abs/cs/0205028. Accessed: Jan. 23, 2024.

S. Takase and S. Kobayashi, "All word embeddings from one embedding," Adv. Neural Inf. Process. Syst., vol. 33, pp. 3775–3785, 2020.

S. S. Sahu and S. Pal, "Effect of stopwords in Indian language IR," Sādhanā, vol. 47, no. 1, p. 17, Mar. 2022. doi: 10.1007/s12046-021-01731-z

H. Singh, "GPStemmer—A Gurmukhi Punjabi Stemmer," in Advances in Data and Information Sciences, vol. 318, S. Tiwari, M. C. Trivedi, M. L. Kolhe, K. K. Mishra, and B. K. Singh, Eds., Lecture Notes in Networks and Systems, vol. 318, Singapore: Springer Singapore, 2022, pp. 493–503.

D. Hadžiosmanović, L. Simionato, D. Bolzoni, E. Zambon, and S. Etalle, "N-Gram against the Machine: On the Feasibility of the N-Gram Network Analysis for Binary Protocols," in Research in Attacks, Intrusions, and Defenses, vol. 7462, D. Balzarotti, S. J. Stolfo, and M. Cova, Eds., Lecture Notes in Computer Science, vol. 7462, Berlin, Heidelberg: Springer Berlin Heidelberg, 2022, pp. 354–373.

A. Sak, "Using cosine similarity classifier for NLP analysis in construction field texts," in AIP Conference Proceedings, AIP Publishing, 2023. [Online]. Available: https://pubs.aip.org/aip/acp/article/2791/1/040010/2905402. Accessed: Jan. 23, 2024.

A. Bražinskas, S. Havrylov, and I. Titov, "Embedding Words as Distributions with a Bayesian Skip-gram Model," arXiv, Jun. 10, 2018. [Online]. Available: http://arxiv.org/abs/1711.11027. Accessed: Jan. 23, 2024.

Z. Xiong, Q. Shen, Y. Xiong, Y. Wang, and W. Li, "New Generation Model of Word Vector Representation Based on CBOW or Skip-Gram," Comput. Mater. Contin., vol. 60, no. 1, 2019. Accessed: Jan. 23, 2024.

P. Preethi Krishna and A. Sharada, "Word Embeddings- Skip Gram Model," in Studies in Big Data, vol. 26, V. K. Gunjan, V. Garcia Diaz, M. Cardona, V. K. Solanki, and K. V. N. Sunitha, Eds., Singapore: Springer Singapore, 2020, pp. 133–139.

Y. Wang, L. Cui, and Y. Zhang, "Improving skip-gram embeddings using BERT," vol. 29, pp. 1318–1328, 2021.

P. Mehndiratta and D. Soni, "Identification of sarcasm using word embeddings and hyperparameters tuning," J. Discrete Math. Sci. Cryptogr., vol. 22, no. 4, pp. 465–489, May 2019. doi: 10.1080/09720529.2019.1637152.

T. Menon, "Empirical analysis of CBOW and skip gram NLP models," 2020. Accessed: Jan. 23, 2024.

W. B. Cavnar and J. M. Trenkle, "N-gram-based text categorization," in Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, Las Vegas, NV, 1994, p. 14. Accessed: Jan. 23, 2024.

A. Basile, G. Dwyer, M. Medvedeva, J. Rawee, H. Haagsma, and M. Nissim, "N-GrAM: New Groningen Author-profiling Model," arXiv, Jul. 12, 2019. Accessed: Jan. 23, 2024.

T. P. Adewumi, F. Liwicki, and M. Liwicki, "Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks," arXiv, Apr. 17, 2021. Accessed: Jan. 23, 2024.

S. Pattnaik and A. K. Nayak, "Summarization of odia text document using cosine similarity and clustering," in 2019 International Conference on Applied Machine Learning (ICAML), IEEE, 2019, pp. 143–146. Accessed: Jan. 23, 2024.

W. Chen et al., "A comparative study of logistic model tree, random forest, and classification and regression tree models for spatial prediction of landslide susceptibility," Catena, vol. 151, pp. 147–160, 2019.

K. Park, J. S. Hong, and W. Kim, "A Methodology Combining Cosine Similarity with Classifier for Text Classification," Appl. Artif. Intell., vol. 34, no. 5, pp. 396–411, Apr. 2020. doi: 10.1080/08839514.2020.1723868.

W. Loh, "Fifty Years of Classification and Regression Trees," Int. Stat. Rev., vol. 82, no. 3, pp. 329–348, Dec. 2014. doi: 10.1111/insr.12019.

D. Alberg, M. Last, and A. Kandel, "Knowledge discovery in data streams with regression tree methods," WIREs Data Min. Knowl. Discov., vol. 2, no. 1, pp. 69–78, Jan. 2022. doi: 10.1002/widm.51.

V. Gregg, "Word frequency, recognition and recall," 1976. Accessed: Jan. 23, 2024.

Downloads

Published

2025-04-04

How to Cite

Shabbir, M., Bhatti, S. F., Ahmed Larik, R. S., Panhwar, A. O., & Muhammad Saif. (2025). Advancing NLP for Shahmukhi Punjabi: Word Embedding and Text Classification with a Novel Dataset. VAWKUM Transactions on Computer Sciences, 13(1), 20–39. https://doi.org/10.21015/vtcs.v13i1.2093