W-rank: A keyphrase extraction method for webpage based on linguistics and DOM-base features

Authors

  • Himat Shah School of Computing, University of Eastern Finland, Joensuu, Finland
  • Dr. Shafique Ahmed Department of CS&IT, Benazir Bhutto Shaheed University Lyari Karachi Sindh Pakistan
  • Anwar Ali Sathio Department of CS& IT, Benazir Bhutto Shaheed University, Karachi, Sindh, Pakistan
  • Dr Asadullah Burdi Institute of Mathematics and Computer Science (IMCS), University of Sindh Jamshoro, Sindh, Pakistan

DOI:

https://doi.org/10.21015/vtcs.v11i1.1493

Abstract

This paper addresses the problem of an automatic keyphrase extraction for a webpage text. Our method is unsupervised, and we call it W-rank. In our method, first we extract the text of a webpage and tokenize into three different candidate words list: unigram ,bigrams and noun phrases. Then we assign score to all words based on their individual appearance in linguistic and DOM-based feature sets. In the  final step, we rank these candidate words using score and select top 5 keyphrase from each list and combine them as a final keyphrases for a given webpage. We focus more on the relevancy of keyphrases to its content using linguistic features. We compare our method with other methods using precision, recall and f-score. The experimental result shows, W-rank improves the performance of our previous method D-rank and outperforms other state of art methods.

References

M. Abulaish, M. Fazil, and M. J. Zaki, "Domain-specific keyword extraction using joint modeling of local and global contextual semantics," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 4, pp. 1-30, 2022.

Q. Hu, J. Shen, K. Wang, J. Du, and Y. Du, "A web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph," Information Sciences, vol. 586, pp. 239-260, 2022.

S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107-117, 1998.

A. Bougouin, F. Boudin, and P. Y. Daille, "Topicrank: Graph-based topic ranking for keyphrase extraction," in International Joint Conference on Natural Language Processing (IJCNLP).

D. Nemirovsky and K. Avrachenkov, "Weighted Pagerank: Cluster-related weights," Saint Petersburg State Univ (Russia), Tech. Rep., 2008.

L. Fu, J. Yang, C. Chen, and C. Zhang, "Low-rank tensor approximation with local structure for multi-view intrinsic subspace clustering," Information Sciences, vol. 606, pp. 877-891, 2022.

A. O. Panhwar, A. A. Sathio, A. Lakhan, M. Umer, R. M. Mithiani, S. Khan, et al., "Plant health detection enabled CNN scheme in IoT network," International Journal of Computing and Digital Systems, vol. 11, no. 1, pp. 344-335, 2022.

M. B. A. Miah, S. Awang, M. S. Azad, and M. M. Rahman, "Keyphrases concentrated area identification from academic articles as a feature of keyphrase extraction: A new unsupervised approach," Int. J. Adv. Comput. Sci. Appl, vol. 13, no. 1, 2022.

M. Q. Khan, A. Shahid, M. I. Uddin, M. Roman, A. Alharbi, W. Alosaimi, J. Almalki, and S. M. Alshahrani, "Impact analysis of keyword extraction using contextual word embedding," PeerJ Computer Science, vol. 8, p. e967, 2022.

B. Liu and L. Zhang, "A survey of opinion mining and sentiment analysis," in Mining Text Data, Springer, 2012, pp. 415-463.

M. Krapivin, A. Autayeu, M. Marchese, E. Blanzieri, and N. Segata, "Keyphrases extraction from scientific documents: Improving machine learning approaches with natural language processing," in The Role of Digital Libraries in a Time of Global Change: 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010, Gold Coast, Australia, June 21-25, 2010. Proceedings 12, Springer, 2010, pp. 102-111.

A. A. Sathio, M. A. Dootio, A. Lakhan, M. Ur Rehman, A. O. Pnhwar, and M. A. Sahito, "Pervasive futuristic healthcare and blockchain-enabled digital identities-challenges and future intentions," in 2021 International Conference on Computing, Electronics & Communications Engineering (iCCECE), IEEE, 2021, pp. 30-35.

Z. A. Shaikh, A. A. Wagan, A. A. Laghari, K. Ali, M. A. Memon, and A. A. Sathio, "The role of software configuration management and capability maturity model in system quality," IJCSNS, vol. 19, no. 11, p. 114, 2019.

P. S. Sharma, D. Yadav, and R. Thakur, "Web page ranking using web mining techniques: a comprehensive survey," Mobile Information Systems, vol. 2022, pp. 1-19, 2022.

A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," in Proceedings of the 2003 conference on Empirical methods in natural language processing, 2003, pp. 216-223.

A. A. Sathio and A. M. Brohi, "The imperative role of pervasive data in healthcare," in Pervasive Healthcare: A Compendium of Critical Factors for Success, pp. 17-29, 2022.

A. A. Sathio, "A study on the conceptual framework of data warehousing in the health sector in Pakistan: A case study of a hospital system and disease (hepatitis C)," International Journal of Computer (IJC), vol. 29, no. 1, pp. 59-81, 2018.

J. Shi, D. Zou, S. Xu, X. Deng, and H. Jin, "Does OpenBSD and Firefox's security improve with time," IEEE Transactions on Dependable and Secure Computing, 2022.

D. Goodman, "Dynamic HTML: The definitive reference: A comprehensive resource for HTML, CSS, DOM & JavaScript," O'Reilly Media, Inc., 2002.

S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "Dom-based content extraction of HTML documents," in Proceedings of the 12th international conference on World Wide Web, 2003, pp. 207-214.

S. Siddiqi and A. Sharan, "Keyword and keyphrase extraction techniques: a literature review," International Journal of Computer Applications, vol. 109, no. 2, 2015.

K. Vaish, G. Deepak, and A. Santhanavijayan, "Dseora: Integration of deep learning and metaheuristics for web page recommendation based on search engine optimization ranking," in Emerging Research in Computing, Information, Communication and Applications: ERCICA 2020, Volume 2, Springer, 2022, pp. 873-883.

W. Zhang, W. Feng, and J. Wang, "Integrating semantic relatedness and words' intrinsic features for keyword extraction," in Twenty-Third International Joint Conference on Artificial Intelligence, Citeseer, 2013.

T. D. Nguyen and M.-Y. Kan, "Keyphrase extraction in scientific publications," in Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers: 10th International Conference on Asian Digital Libraries, ICADL 2007, Hanoi, Vietnam, December 10-13, 2007. Proceedings 10, Springer, 2007, pp. 317-326.

N. Zhou, W. Shi, R. Liang, and N. Zhong, "Textrank keyword extraction algorithm using word vector clustering based on rough data-deduction," Computational Intelligence and Neuroscience, vol. 2022, 2022.

P. Tonella, F. Ricca, E. Pianta, and C. Girardi, "Using keyword extraction for website clustering," in Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings, IEEE, 2003, pp. 41-48.

W. W. Cohen, "Automatically extracting features for concept learning from the web," in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 159-166.

M. Rezaei, N. Gali, and P. Fränti, "Clrank: A method for keyword extraction from web pages using clustering and distribution of nouns," in 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, IEEE, 2015, pp. 79-84.

H. Shah, M. U. Khan, and P. Fränti, "H-rank: A keywords extraction method from web pages using POS tags," in 2019 IEEE 17th International Conference on Industrial Informatics (INDIN), vol. 1, IEEE, 2019, pp. 264-269.

J. M. R. Tavares et al., "Aiipcc'19: Proceedings of the international conference on artificial intelligence, information processing and cloud computing," 2019.

M. Zhang, X. Li, S. Yue, and L. Yang, "An empirical study of textrank for keyword extraction," IEEE Access, vol. 8, pp. 178849-178858, 2020.

A. N. Langville and C. D. Meyer, "Google's PageRank and beyond: The science of search engine rankings," Princeton University Press, 2006.

A. Bellaachia and M. Al-Dhelaan, "Ne-rank: A novel graph-based keyphrase extraction in Twitter," in 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, vol. 1, IEEE, 2012, pp. 372-379.

Z. Liu, P. Li, Y. Zheng, and M. Sun, "Clustering to find exemplar terms for keyphrase extraction," in Proceedings of the 2009 conference on empirical methods in natural language processing, 2009, pp. 257-266.

Z. Ahmed Shaikh, A. Ali Sathio, A. Ali Laghari, M. Ahmed Memon, and I. Hussain Mirani, "Study of the role of new technologies in the pharmaceutical industry," Journal of Pharmaceutical Research International, vol. 31, no. 6, pp. 1-11, 2019.

A. Ali, "Implementation of ETL tool for data warehousing for non-Hodgkin lymphoma (NHL) cancer in the public sector, Pakistan," LC International Journal of STEM (ISSN: 2708-7123), vol. 2, no. 3, pp. 98-102, 2021.

X. Zhang, Y. Wang, N. Mou, and W. Liang, "Propagating both trust and distrust with target differentiation for combating link-based web spam," ACM Transactions on the Web (TWEB), vol. 8, no. 3, pp. 1-33, 2014.

Downloads

Published

2023-05-30

How to Cite

Shah, H., Ahmed, D. S., Sathio, A. A., & Burdi, D. A. (2023). W-rank: A keyphrase extraction method for webpage based on linguistics and DOM-base features. VAWKUM Transactions on Computer Sciences, 11(1), 217–228. https://doi.org/10.21015/vtcs.v11i1.1493