A Two-Stage Noisy Pre-training and Fine-tuning Pipeline for Low-Resource Named Entity Recognition in Shahmukhi Punjabi

Nazish Basir; Mumtaz Qabulio; Muhammad Suleman Memon; Danish Nazir Arain; Sehrish Basir Nizamani

doi:10.21015/vtse.v14i2.2371

Authors

Nazish Basir Department of Information Technology, FET, University of Sindh, Jamshoro, Pakistan https://orcid.org/0000-0003-4341-0969
Mumtaz Qabulio Department of Software Engineering, FET, University of Sindh, Jamshoro, Pakistan https://orcid.org/0000-0002-9294-2657
Muhammad Suleman Memon Department of Information Technology Dadu Campus University of Sindh, Pakistan https://orcid.org/0000-0002-0418-0681
Danish Nazir Arain Dr. A. H. S. Bukhari Postgraduate Centre Of ICT, University of Sindh, Pakistan https://orcid.org/0000-0002-6635-793X
Sehrish Basir Nizamani Department of Computer Science, Virginia Tech, Blacksburg, United States https://orcid.org/0000-0001-5419-1058

DOI:

https://doi.org/10.21015/vtse.v14i2.2371

Abstract

Named Entity Recognition (NER) for low-resource languages remains a critical challenge in natural language processing, particularly for scripts with limited annotated corpora. This paper addresses this challenge for Shahmukhi Punjabi, an underrepresented Perso-Arabic script used by millions in Pakistan. We propose a two-stage training pipeline that leverages a large-scale machine-labeled corpus generated by a Bagging-CRF ensemble to warm-start multilingual transformer models before fine-tuning on a small, gold-standard human-annotated dataset. We evaluate five state-of-the-art multilingual transformers, mBERT, XLM-R, mmBERT, RemBERT, and mDeBERTa-V3, under two experimental settings: (A) direct supervised fine-tuning on the human-labeled dataset, and (B) the proposed two-stage pipeline. The human-labeled dataset comprises 979 sentences and 25,221 tokens, while the larger machine-labeled corpus having 16,586 sentences and 336,502, both tokens covering 13 entity types. Experimental results demonstrate consistent improvements across all five models, mmBERT and RemBERT achieve the highest weighted F1 scores of 0.85 and 0.86 respectively. The most striking gains are observed for mDeBERTa-V3 (+0.21 F1, 39.6% relative) and XLM-R (+0.20 F1, 33.3% relative), demonstrating that the two-stage pipeline provides the greatest benefit to models with limited baseline performance on low-resource scripts. These results validate the effectiveness of noisy domain adaptation as a data augmentation strategy for low-resource NER in morphologically rich, right-to-left scripts.

References

P. Pakray, A. Gelbukh, and S. Bandyopadhyay, “Natural language processing applications for low-resource languages,” Natural Language Processing, vol. 31, no. 2, pp. 183–197, 2025. DOI: https://doi.org/10.1017/nlp.2024.33

M. S. Tahir, M. Ahmad, and S. M. Zahra, “Adaptation and development of universal dependencies for Punjabi (Shahmukhi) script: Challenges and linguistic insights,” Pakistan Research Journal of Social Sciences, vol. 3, no. 3, 2024.

H. Khalid, G. Murtaza, and Q. Abbas, “Using data augmentation and bidirectional encoder representations from transformers for improving Punjabi named entity recognition,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 6, pp. 1–13, 2023. DOI: https://doi.org/10.1145/3595861

T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4996–5001. DOI: https://doi.org/10.18653/v1/P19-1493

A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451. DOI: https://doi.org/10.18653/v1/2020.acl-main.747

H. W. Chung, T. Fevry, H. Tsai, M. Johnson, and S. Ruder, “Rethinking embedding coupling in pre-trained language models,” arXiv preprint arXiv:2010.12821, 2020.

P. He, J. Gao, and W. Chen, “DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing,” arXiv preprint arXiv:2111.09543, 2021.

M. Marone, O. Weller, W. Fleshman, E. Yang, D. Lawrie, and B. Van Durme, “mmBERT: A modern multilingual encoder with annealed language learning,” arXiv preprint arXiv:2509.06888, 2025.

R. Doctor, A. Gutkin, C. Johny, B. Roark, and R. Sproat, “Graphemic normalization of the Perso-Arabic script,” arXiv preprint arXiv:2210.12273, 2022.

A. Gutkin, C. Johny, R. Doctor, B. Roark, and R. Sproat, “Beyond Arabic: Software for Perso-Arabic script manipulation,” in Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), 2022, pp. 381–387. DOI: https://doi.org/10.18653/v1/2022.wanlp-1.36

M. T. Ahmad, M. K. Malik, K. Shahzad, F. Aslam, A. Iqbal, Z. Nawaz, and F. Bukhari, “Named entity recognition and classification for Punjabi Shahmukhi,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 4, pp. 1–13, 2020. DOI: https://doi.org/10.1145/3383306

A. Tehseen, T. Ehsan, H. B. Liaqat, X. Kong, A. Ali, and A. Al-Fuqaha, “Shahmukhi named entity recognition using contextualized word embeddings,” Expert Systems with Applications, vol. 229, p. 120489, 2023. DOI: https://doi.org/10.1016/j.eswa.2023.120489

T. Ehsan and T. Solorio, “Enhancing NER performance in low-resource Pakistani languages using cross-lingual data augmentation,” in Proceedings of the Tenth Workshop on Noisy and User-Generated Text, 2025, pp. 117–132. DOI: https://doi.org/10.18653/v1/2025.wnut-1.13

E. Hasan, M. M. Iqbal, Q. R. Azeemi, and A. Javeed, “An online Punjabi Shahmukhi lexical resource,” Sci. Int. (Lahore), vol. 27, pp. 2529–2535, 2015.

M. A. Hashmi, M. A. Mahmood, and M. I. Mahmood, “Desarrollo de marcas diacríticas para los nombres y verbos de Punjabi Shahmukhi,” Dilemas Contemporáneos: Educación, Política y Valores, 2019.

A. Tehseen, T. Ehsan, H. B. Liaqat, A. Ali, and A. Al-Fuqaha, “Neural POS tagging of Shahmukhi using contextualized word representations,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 1, pp. 335–356, 2023. DOI: https://doi.org/10.1016/j.jksuci.2022.12.004

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423

W. Antoun and B. Sagot, “ModernBERT or DeBERTaV3? Examining architecture and data influence on transformer encoder models performance,” in Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 3061–3074.

D. I. Adelani et al., “MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4488–4508.

B. Fetahu, Z. Chen, S. Kar, O. Rokhlenko, and S. Malmasi, “MultiCoNER v2: A large multilingual dataset for fine-grained and noisy named entity recognition,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2027–2051. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.134

A. Ahmed, D. Huang, S. Y. Arafat, and I. Hameed, “Enriching Urdu NER with BERT embedding, data augmentation, and hybrid encoder-CNN architecture,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 4, pp. 1–38, 2024. DOI: https://doi.org/10.1145/3648362

I. Bouabdallaoui, F. Guerouate, S. Bouhaddour, C. Saadi, and M. Sbihi, “FewTopNER: Integrating few-shot learning with topic modeling and named entity recognition in a multilingual framework,” arXiv preprint arXiv:2502.02391, 2025.

M. Sabane, A. Ranade, O. Litake, P. Patil, R. Joshi, and D. Kadam, “Enhancing low-resource NER using assisting language and transfer learning,” in Proceedings of the 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), 2023, pp. 1666–1671. DOI: https://doi.org/10.1109/ICAAIC56838.2023.10141204

M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin, “Tuning multilingual transformers for named entity recognition on Slavic languages,” in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, pp. 89–93. DOI: https://doi.org/10.18653/v1/W19-3712

J. Kim, Y. Ko, and J. Seo, “Construction of machine-labeled data for improving named entity recognition by transfer learning,” IEEE Access, vol. 8, pp. 59684–59693, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.2981361

K. Qian, P. C. Raman, Y. Li, and L. Popa, “Learning structured representations of entity names using active learning and weak supervision,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6376–6383. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.517

J. Kim, Y. Ko, and J. Seo, “A bootstrapping approach with CRF and deep learning models for improving biomedical named entity recognition in multi-domains,” IEEE Access, vol. 7, pp. 70308–70318, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2914168

L. Gligic, A. Kormilitzin, P. Goldberg, and A. Nevado-Holgado, “Named entity recognition in electronic health records using transfer learning bootstrapped neural networks,” Neural Networks, vol. 121, pp. 132–139, 2020. DOI: https://doi.org/10.1016/j.neunet.2019.08.032

N. Basir, G. Haider, D. N. Arain, S. B. Nizamani, and S. Nizamani, “Enhancing Pashto NER using machine-labeled data and transformer-based models,” in Proceedings of the 2025 20th International Conference on Emerging Technologies (ICET), 2025, pp. 1–8. DOI: https://doi.org/10.1109/ICET66147.2025.11321232

N. Basir, M. Qabulio, M. S. Memon, D. N. Arain, S. B. Nizamani, and S. Nizamani, “Expanded entity coverage and machine-annotated pre-training for Urdu named entity recognition,” The Asian Bulletin of Big Data Management, vol. 6, no. 1, pp. 77–93, 2026. DOI: https://doi.org/10.62019/gypnn177

N. Basir, M. S. Memon, M. Qabulio, D. N. Arain, and R. A. Vighio, “Bridging data scarcity in Sindhi NER using machine-labeled corpora and multilingual transformers,” Spectrum of Engineering Sciences, vol. 4, no. 3, pp. 179–194, 2026.

FastText, “Pre-trained Urdu embeddings (cc.ur.300),” [Online]. Available: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ur.300.bin.gz

Pakistan Point, “Pakistan Point Punjabi News,” [Online]. Available: https://www.pakistanpoint.com/pn/national.html

Punjabi Wikipedia, “Punjabi Wikipedia,” [Online]. Available: https://pnb.wikipedia.org/

A Two-Stage Noisy Pre-training and Fine-tuning Pipeline for Low-Resource Named Entity Recognition in Shahmukhi Punjabi

Authors

DOI:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

ISSN

Scopus Metrics

SCImago

Scopus CiteScore

Make a Submission