A Decision Tree Based Approach for Pashto Coreference Resolution: The Case of Person Name Aliases

Authors

DOI:

https://doi.org/10.21015/vtse.v13i2.2143

Abstract

Coreference resolution is an important problem in fields such as natural language understanding, natural language generation, named entity recognition, text summarization, and anaphora resolution. Determining whether or not two proper nouns are aliases of each other (i.e. aliases identification) is a classification problem. A binary classifier for alias identification is needed which returns “Yes” if the two input nouns are aliases and “No” otherwise. In this research paper, a binary decision tree based classifier is proposed that is augmented with cosine similarity measure for personal name aliases identification in Pashto. This classifier is trained on aliases records containing features’ vectors.  A total of 10000 proper nouns’ pairs examples from the Pashto corpus have been extracted and a collection of crawled Pashto text, and recorded their features in this work. This resulted in 10000 example records, having 12 attributes. The selected dataset contains examples from different genres of the corpus e.g. novels, dramas, news, sports, letters and essays. These examples contain 5000 positive instances (i.e. class “Yes”) and 5000 negative instances (i.e. class “No”). These records are divided into two parts: the training part and the testing part in the ratio of 7:3. The 7000 examples of training part are used to induct the decision tree. This decision tree is created using Rapidminer, which is a data mining tool. Then, first order logic rules are created from the decision tree. These rules are then transformed into an algorithm, which is implemented in programming language Python. These rules are tested on the testing part of examples, which contain 3000 labeled examples. A total of 2794 out of these 3000 examples are classified correctly, which means an accuracy of approximately 93%. The error analysis of the 7% classification errors is performed to improve the system in future.

References

I. Ali, H. Kamigaito, and T. Watanabe, “Monolingual paraphrase detection corpus for low resource Pashto language at sentence level,” in Proc. 2024 Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, May 2024, pp. 11574–11581. [Online]. Available: https://aclanthology.org/2024.lrec-main.1011/

J. Atkinson and A. Escudero, “Evolutionary natural language coreference resolution for sentiment analysis,” Int. J. Inf. Manag. Data Insights, vol. 2, no. 2, p. 100115, 2022. doi: 10.1016/j.jjimei.2022.100115.

M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures,” in Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD’03), Washington, D.C., 2003, pp. 39–48. doi: 10.1145/956750.956759. DOI: https://doi.org/10.1145/956755.956759

D. Bollegala, Y. Matsuo, and M. Ishizuka, “Automatic discovery of personal name aliases from the Web,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 831–844, 2011. doi: 10.1109/TKDE.2010.162. DOI: https://doi.org/10.1109/TKDE.2010.162

S. Gao, S. Li, W. Xu, and J. Guo, “Cross document coreference resolution based on automatic text summary,” in 2010 3rd Int. Conf. Knowledge Discovery and Data Mining, 2010, pp. 306–309. doi: 10.1109/WKDD.2010.56. DOI: https://doi.org/10.1109/WKDD.2010.56

I. Haq, W. Qiu, J. Guo, and P. Tang, “Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT,” PeerJ Comput. Sci., vol. 9, p. e1617, 2023. doi: 10.7717/peerj-cs.1617.

I. Haq, W. Qiu, J. Guo, and P. Tang, “Correction of whitespace and word segmentation in noisy Pashto text using CRF,” Speech Commun., vol. 153, p. 102970, Sep. 2023. doi: 10.1016/j.specom.2023.102970.

I. Haq, W. Qiu, J. Guo, and P. Tang, “NLPashto: NLP toolkit for low-resource Pashto language,” Int. J. Adv. Comput. Sci. Appl., vol. 14, no. 6, pp. 1345–1352, 2023. doi: 10.14569/IJACSA.2023.01406142.

X. Jiang, “Decision tree based prediction system for word difficulty classification,” Appl. Comput. Eng., vol. 54, pp. 184–192, 2024. doi: 10.1016/j.ijcl.2025.03.007.

D. Jurafsky and J. H. Martin, Speech and Language Processing, 3rd Draft ed. Wiley, 2021.

M. A. Khan and F. T. Zuhra, “A corpus-based study of Pashto,” in 2009 Corpus Linguistics Conf., Liverpool, UK, 2009.

S. Khan, S. Nazir, H. U. Khan, and A. Hussain, “Pashto characters recognition using multi-class enabled support vector machine,” Comput. Mater. Continua, vol. 67, no. 3, pp. 2831–2844, 2021. doi: 10.32604/cmc.2021.015054.

Y. Liao, H. Liu, and I. Spasić, “Fine-tuning coreference resolution for different styles of clinical narratives,” J. Biomed. Inform., vol. 149, p. 104578, 2023. doi: 10.1016/j.jbi.2023.104578.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proc. ICLR Workshop, 2013.

T. A. Miller, D. Dligach, S. Bethard, C. Lin, and G. K. Savova, “Towards generalizable entity-centric clinical coreference resolution,” J. Biomed. Inform., vol. 69, pp. 251–258, 2017. doi: 10.1016/j.jbi.2017.04.015. DOI: https://doi.org/10.1016/j.jbi.2017.04.015

R. Mishra, P. Desur, R. R. Shah, and P. Kumaraguru, “Multilingual coreference resolution in low-resource South Asian languages,” in Proc. 2024 Joint Int. Conf. Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 2024, pp. 11813–11826. [Online]. Available: https://aclanthology.org/2024.lrec-main.1031/

V. Ng and C. Cardie, “Improving machine learning approaches to coreference resolution,” in Proc. 40th Annu. Meet. Assoc. Comput. Linguistics, Philadelphia, PA, USA, 2002, pp. 104–111. doi: 10.3115/1073083.1073102. DOI: https://doi.org/10.3115/1073083.1073102

W. M. Soon, H. T. Ng, and D. C. Y. Lim, “A machine learning approach to coreference resolution of noun phrases,” Comput. Linguist., vol. 27, no. 4, pp. 521–544, 2001. doi: 10.1162/089120101753342653. DOI: https://doi.org/10.1162/089120101753342653

T. Anwar and M. Abulaish, “Namesake alias mining on the Web and its role towards suspect tracking,” Inf. Sci., vol. 276, pp. 123–145, 2014. doi: 10.1016/j.ins.2014.02.050. DOI: https://doi.org/10.1016/j.ins.2014.02.050

X. Yang, J. Su, and C. L. Tan, “Improving noun phrase coreference resolution by matching strings,” in Proc. 1st Int. Joint Conf. Natural Language Processing (IJCNLP 2004), 2004, pp. 22–31. [Online]. Available: https://aclanthology.org/I04-1003/ DOI: https://doi.org/10.1007/978-3-540-30211-7_3

Y. Zhang, J. Zhou, S. Huang, and J. Chen, “Combining ILP and MLN for coreference resolution,” in 2009 Int. Conf. Asian Language Processing, 2009, pp. 59–64. DOI: https://doi.org/10.1109/IALP.2009.21

W. Zhao, Y. Zhang, D. Wu, F. Wu, and N. Jain, “Hypergraph convolutional networks with multi-ordering relations for cross-document event coreference resolution,” Inf. Fusion, vol. 115, p. 102769, 2025. doi: 10.1016/j.inffus.2024.102769.

Downloads

Published

2025-06-06

How to Cite

Zuhra, F. T., Ali, H., & Naz, S. (2025). A Decision Tree Based Approach for Pashto Coreference Resolution: The Case of Person Name Aliases. VFAST Transactions on Software Engineering, 13(2), 161–169. https://doi.org/10.21015/vtse.v13i2.2143

Issue

Section

Articles