A Two-Stage Noisy Pre-training and Fine-tuning Pipeline for Low-Resource Named Entity Recognition in Shahmukhi Punjabi
DOI:
https://doi.org/10.21015/vtse.v14i2.2371Abstract
Named Entity Recognition (NER) for low-resource languages remains a critical challenge in natural language processing, particularly for scripts with limited annotated corpora. This paper addresses this challenge for Shahmukhi Punjabi, an underrepresented Perso-Arabic script used by millions in Pakistan. We propose a two-stage training pipeline that leverages a large-scale machine-labeled corpus generated by a Bagging-CRF ensemble to warm-start multilingual transformer models before fine-tuning on a small, gold-standard human-annotated dataset. We evaluate five state-of-the-art multilingual transformers, mBERT, XLM-R, mmBERT, RemBERT, and mDeBERTa-V3, under two experimental settings: (A) direct supervised fine-tuning on the human-labeled dataset, and (B) the proposed two-stage pipeline. The human-labeled dataset comprises 979 sentences and 25,221 tokens, while the larger machine-labeled corpus having 16,586 sentences and 336,502, both tokens covering 13 entity types. Experimental results demonstrate consistent improvements across all five models, mmBERT and RemBERT achieve the highest weighted F1 scores of 0.85 and 0.86 respectively. The most striking gains are observed for mDeBERTa-V3 (+0.21 F1, 39.6% relative) and XLM-R (+0.20 F1, 33.3% relative), demonstrating that the two-stage pipeline provides the greatest benefit to models with limited baseline performance on low-resource scripts. These results validate the effectiveness of noisy domain adaptation as a data augmentation strategy for low-resource NER in morphologically rich, right-to-left scripts.
References
P. Pakray, A. Gelbukh, and S. Bandyopadhyay, “Natural language processing applications for low-resource languages,” Natural Language Processing, vol. 31, no. 2, pp. 183–197, 2025. DOI: https://doi.org/10.1017/nlp.2024.33
M. S. Tahir, M. Ahmad, and S. M. Zahra, “Adaptation and development of universal dependencies for Punjabi (Shahmukhi) script: Challenges and linguistic insights,” Pakistan Research Journal of Social Sciences, vol. 3, no. 3, 2024.
H. Khalid, G. Murtaza, and Q. Abbas, “Using data augmentation and bidirectional encoder representations from transformers for improving Punjabi named entity recognition,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 6, pp. 1–13, 2023. DOI: https://doi.org/10.1145/3595861
T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4996–5001. DOI: https://doi.org/10.18653/v1/P19-1493
A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440–8451. DOI: https://doi.org/10.18653/v1/2020.acl-main.747
H. W. Chung, T. Fevry, H. Tsai, M. Johnson, and S. Ruder, “Rethinking embedding coupling in pre-trained language models,” arXiv preprint arXiv:2010.12821, 2020.
P. He, J. Gao, and W. Chen, “DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing,” arXiv preprint arXiv:2111.09543, 2021.
M. Marone, O. Weller, W. Fleshman, E. Yang, D. Lawrie, and B. Van Durme, “mmBERT: A modern multilingual encoder with annealed language learning,” arXiv preprint arXiv:2509.06888, 2025.
R. Doctor, A. Gutkin, C. Johny, B. Roark, and R. Sproat, “Graphemic normalization of the Perso-Arabic script,” arXiv preprint arXiv:2210.12273, 2022.
A. Gutkin, C. Johny, R. Doctor, B. Roark, and R. Sproat, “Beyond Arabic: Software for Perso-Arabic script manipulation,” in Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), 2022, pp. 381–387. DOI: https://doi.org/10.18653/v1/2022.wanlp-1.36
M. T. Ahmad, M. K. Malik, K. Shahzad, F. Aslam, A. Iqbal, Z. Nawaz, and F. Bukhari, “Named entity recognition and classification for Punjabi Shahmukhi,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 4, pp. 1–13, 2020. DOI: https://doi.org/10.1145/3383306
A. Tehseen, T. Ehsan, H. B. Liaqat, X. Kong, A. Ali, and A. Al-Fuqaha, “Shahmukhi named entity recognition using contextualized word embeddings,” Expert Systems with Applications, vol. 229, p. 120489, 2023. DOI: https://doi.org/10.1016/j.eswa.2023.120489
T. Ehsan and T. Solorio, “Enhancing NER performance in low-resource Pakistani languages using cross-lingual data augmentation,” in Proceedings of the Tenth Workshop on Noisy and User-Generated Text, 2025, pp. 117–132. DOI: https://doi.org/10.18653/v1/2025.wnut-1.13
E. Hasan, M. M. Iqbal, Q. R. Azeemi, and A. Javeed, “An online Punjabi Shahmukhi lexical resource,” Sci. Int. (Lahore), vol. 27, pp. 2529–2535, 2015.
M. A. Hashmi, M. A. Mahmood, and M. I. Mahmood, “Desarrollo de marcas diacríticas para los nombres y verbos de Punjabi Shahmukhi,” Dilemas Contemporáneos: Educación, Política y Valores, 2019.
A. Tehseen, T. Ehsan, H. B. Liaqat, A. Ali, and A. Al-Fuqaha, “Neural POS tagging of Shahmukhi using contextualized word representations,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 1, pp. 335–356, 2023. DOI: https://doi.org/10.1016/j.jksuci.2022.12.004
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423
W. Antoun and B. Sagot, “ModernBERT or DeBERTaV3? Examining architecture and data influence on transformer encoder models performance,” in Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, 2025, pp. 3061–3074.
D. I. Adelani et al., “MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4488–4508.
B. Fetahu, Z. Chen, S. Kar, O. Rokhlenko, and S. Malmasi, “MultiCoNER v2: A large multilingual dataset for fine-grained and noisy named entity recognition,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 2027–2051. DOI: https://doi.org/10.18653/v1/2023.findings-emnlp.134
A. Ahmed, D. Huang, S. Y. Arafat, and I. Hameed, “Enriching Urdu NER with BERT embedding, data augmentation, and hybrid encoder-CNN architecture,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 4, pp. 1–38, 2024. DOI: https://doi.org/10.1145/3648362
I. Bouabdallaoui, F. Guerouate, S. Bouhaddour, C. Saadi, and M. Sbihi, “FewTopNER: Integrating few-shot learning with topic modeling and named entity recognition in a multilingual framework,” arXiv preprint arXiv:2502.02391, 2025.
M. Sabane, A. Ranade, O. Litake, P. Patil, R. Joshi, and D. Kadam, “Enhancing low-resource NER using assisting language and transfer learning,” in Proceedings of the 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), 2023, pp. 1666–1671. DOI: https://doi.org/10.1109/ICAAIC56838.2023.10141204
M. Arkhipov, M. Trofimova, Y. Kuratov, and A. Sorokin, “Tuning multilingual transformers for named entity recognition on Slavic languages,” in Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, pp. 89–93. DOI: https://doi.org/10.18653/v1/W19-3712
J. Kim, Y. Ko, and J. Seo, “Construction of machine-labeled data for improving named entity recognition by transfer learning,” IEEE Access, vol. 8, pp. 59684–59693, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.2981361
K. Qian, P. C. Raman, Y. Li, and L. Popa, “Learning structured representations of entity names using active learning and weak supervision,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6376–6383. DOI: https://doi.org/10.18653/v1/2020.emnlp-main.517
J. Kim, Y. Ko, and J. Seo, “A bootstrapping approach with CRF and deep learning models for improving biomedical named entity recognition in multi-domains,” IEEE Access, vol. 7, pp. 70308–70318, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2914168
L. Gligic, A. Kormilitzin, P. Goldberg, and A. Nevado-Holgado, “Named entity recognition in electronic health records using transfer learning bootstrapped neural networks,” Neural Networks, vol. 121, pp. 132–139, 2020. DOI: https://doi.org/10.1016/j.neunet.2019.08.032
N. Basir, G. Haider, D. N. Arain, S. B. Nizamani, and S. Nizamani, “Enhancing Pashto NER using machine-labeled data and transformer-based models,” in Proceedings of the 2025 20th International Conference on Emerging Technologies (ICET), 2025, pp. 1–8. DOI: https://doi.org/10.1109/ICET66147.2025.11321232
N. Basir, M. Qabulio, M. S. Memon, D. N. Arain, S. B. Nizamani, and S. Nizamani, “Expanded entity coverage and machine-annotated pre-training for Urdu named entity recognition,” The Asian Bulletin of Big Data Management, vol. 6, no. 1, pp. 77–93, 2026. DOI: https://doi.org/10.62019/gypnn177
N. Basir, M. S. Memon, M. Qabulio, D. N. Arain, and R. A. Vighio, “Bridging data scarcity in Sindhi NER using machine-labeled corpora and multilingual transformers,” Spectrum of Engineering Sciences, vol. 4, no. 3, pp. 179–194, 2026.
FastText, “Pre-trained Urdu embeddings (cc.ur.300),” [Online]. Available: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ur.300.bin.gz
Pakistan Point, “Pakistan Point Punjabi News,” [Online]. Available: https://www.pakistanpoint.com/pn/national.html
Punjabi Wikipedia, “Punjabi Wikipedia,” [Online]. Available: https://pnb.wikipedia.org/
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY