W-rank: A keyphrase extraction method for webpage based on linguistics and DOM-base features
DOI:
https://doi.org/10.21015/vtcs.v11i1.1493Abstract
This paper addresses the problem of an automatic keyphrase extraction for a webpage text. Our method is unsupervised, and we call it W-rank. In our method, first we extract the text of a webpage and tokenize into three different candidate words list: unigram ,bigrams and noun phrases. Then we assign score to all words based on their individual appearance in linguistic and DOM-based feature sets. In the final step, we rank these candidate words using score and select top 5 keyphrase from each list and combine them as a final keyphrases for a given webpage. We focus more on the relevancy of keyphrases to its content using linguistic features. We compare our method with other methods using precision, recall and f-score. The experimental result shows, W-rank improves the performance of our previous method D-rank and outperforms other state of art methods.
References
M. Abulaish, M. Fazil, and M. J. Zaki, "Domain-specific keyword extraction using joint modeling of local and global contextual semantics," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 4, pp. 1-30, 2022.
Q. Hu, J. Shen, K. Wang, J. Du, and Y. Du, "A web service clustering method based on topic enhanced Gibbs sampling algorithm for the Dirichlet Multinomial Mixture model and service collaboration graph," Information Sciences, vol. 586, pp. 239-260, 2022.
S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107-117, 1998.
A. Bougouin, F. Boudin, and P. Y. Daille, "Topicrank: Graph-based topic ranking for keyphrase extraction," in International Joint Conference on Natural Language Processing (IJCNLP).
D. Nemirovsky and K. Avrachenkov, "Weighted Pagerank: Cluster-related weights," Saint Petersburg State Univ (Russia), Tech. Rep., 2008.
L. Fu, J. Yang, C. Chen, and C. Zhang, "Low-rank tensor approximation with local structure for multi-view intrinsic subspace clustering," Information Sciences, vol. 606, pp. 877-891, 2022.
A. O. Panhwar, A. A. Sathio, A. Lakhan, M. Umer, R. M. Mithiani, S. Khan, et al., "Plant health detection enabled CNN scheme in IoT network," International Journal of Computing and Digital Systems, vol. 11, no. 1, pp. 344-335, 2022.
M. B. A. Miah, S. Awang, M. S. Azad, and M. M. Rahman, "Keyphrases concentrated area identification from academic articles as a feature of keyphrase extraction: A new unsupervised approach," Int. J. Adv. Comput. Sci. Appl, vol. 13, no. 1, 2022.
M. Q. Khan, A. Shahid, M. I. Uddin, M. Roman, A. Alharbi, W. Alosaimi, J. Almalki, and S. M. Alshahrani, "Impact analysis of keyword extraction using contextual word embedding," PeerJ Computer Science, vol. 8, p. e967, 2022.
B. Liu and L. Zhang, "A survey of opinion mining and sentiment analysis," in Mining Text Data, Springer, 2012, pp. 415-463.
M. Krapivin, A. Autayeu, M. Marchese, E. Blanzieri, and N. Segata, "Keyphrases extraction from scientific documents: Improving machine learning approaches with natural language processing," in The Role of Digital Libraries in a Time of Global Change: 12th International Conference on Asia-Pacific Digital Libraries, ICADL 2010, Gold Coast, Australia, June 21-25, 2010. Proceedings 12, Springer, 2010, pp. 102-111.
A. A. Sathio, M. A. Dootio, A. Lakhan, M. Ur Rehman, A. O. Pnhwar, and M. A. Sahito, "Pervasive futuristic healthcare and blockchain-enabled digital identities-challenges and future intentions," in 2021 International Conference on Computing, Electronics & Communications Engineering (iCCECE), IEEE, 2021, pp. 30-35.
Z. A. Shaikh, A. A. Wagan, A. A. Laghari, K. Ali, M. A. Memon, and A. A. Sathio, "The role of software configuration management and capability maturity model in system quality," IJCSNS, vol. 19, no. 11, p. 114, 2019.
P. S. Sharma, D. Yadav, and R. Thakur, "Web page ranking using web mining techniques: a comprehensive survey," Mobile Information Systems, vol. 2022, pp. 1-19, 2022.
A. Hulth, "Improved automatic keyword extraction given more linguistic knowledge," in Proceedings of the 2003 conference on Empirical methods in natural language processing, 2003, pp. 216-223.
A. A. Sathio and A. M. Brohi, "The imperative role of pervasive data in healthcare," in Pervasive Healthcare: A Compendium of Critical Factors for Success, pp. 17-29, 2022.
A. A. Sathio, "A study on the conceptual framework of data warehousing in the health sector in Pakistan: A case study of a hospital system and disease (hepatitis C)," International Journal of Computer (IJC), vol. 29, no. 1, pp. 59-81, 2018.
J. Shi, D. Zou, S. Xu, X. Deng, and H. Jin, "Does OpenBSD and Firefox's security improve with time," IEEE Transactions on Dependable and Secure Computing, 2022.
D. Goodman, "Dynamic HTML: The definitive reference: A comprehensive resource for HTML, CSS, DOM & JavaScript," O'Reilly Media, Inc., 2002.
S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, "Dom-based content extraction of HTML documents," in Proceedings of the 12th international conference on World Wide Web, 2003, pp. 207-214.
S. Siddiqi and A. Sharan, "Keyword and keyphrase extraction techniques: a literature review," International Journal of Computer Applications, vol. 109, no. 2, 2015.
K. Vaish, G. Deepak, and A. Santhanavijayan, "Dseora: Integration of deep learning and metaheuristics for web page recommendation based on search engine optimization ranking," in Emerging Research in Computing, Information, Communication and Applications: ERCICA 2020, Volume 2, Springer, 2022, pp. 873-883.
W. Zhang, W. Feng, and J. Wang, "Integrating semantic relatedness and words' intrinsic features for keyword extraction," in Twenty-Third International Joint Conference on Artificial Intelligence, Citeseer, 2013.
T. D. Nguyen and M.-Y. Kan, "Keyphrase extraction in scientific publications," in Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers: 10th International Conference on Asian Digital Libraries, ICADL 2007, Hanoi, Vietnam, December 10-13, 2007. Proceedings 10, Springer, 2007, pp. 317-326.
N. Zhou, W. Shi, R. Liang, and N. Zhong, "Textrank keyword extraction algorithm using word vector clustering based on rough data-deduction," Computational Intelligence and Neuroscience, vol. 2022, 2022.
P. Tonella, F. Ricca, E. Pianta, and C. Girardi, "Using keyword extraction for website clustering," in Fifth IEEE International Workshop on Web Site Evolution, 2003. Theme: Architecture. Proceedings, IEEE, 2003, pp. 41-48.
W. W. Cohen, "Automatically extracting features for concept learning from the web," in Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 159-166.
M. Rezaei, N. Gali, and P. Fränti, "Clrank: A method for keyword extraction from web pages using clustering and distribution of nouns," in 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, IEEE, 2015, pp. 79-84.
H. Shah, M. U. Khan, and P. Fränti, "H-rank: A keywords extraction method from web pages using POS tags," in 2019 IEEE 17th International Conference on Industrial Informatics (INDIN), vol. 1, IEEE, 2019, pp. 264-269.
J. M. R. Tavares et al., "Aiipcc'19: Proceedings of the international conference on artificial intelligence, information processing and cloud computing," 2019.
M. Zhang, X. Li, S. Yue, and L. Yang, "An empirical study of textrank for keyword extraction," IEEE Access, vol. 8, pp. 178849-178858, 2020.
A. N. Langville and C. D. Meyer, "Google's PageRank and beyond: The science of search engine rankings," Princeton University Press, 2006.
A. Bellaachia and M. Al-Dhelaan, "Ne-rank: A novel graph-based keyphrase extraction in Twitter," in 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, vol. 1, IEEE, 2012, pp. 372-379.
Z. Liu, P. Li, Y. Zheng, and M. Sun, "Clustering to find exemplar terms for keyphrase extraction," in Proceedings of the 2009 conference on empirical methods in natural language processing, 2009, pp. 257-266.
Z. Ahmed Shaikh, A. Ali Sathio, A. Ali Laghari, M. Ahmed Memon, and I. Hussain Mirani, "Study of the role of new technologies in the pharmaceutical industry," Journal of Pharmaceutical Research International, vol. 31, no. 6, pp. 1-11, 2019.
A. Ali, "Implementation of ETL tool for data warehousing for non-Hodgkin lymphoma (NHL) cancer in the public sector, Pakistan," LC International Journal of STEM (ISSN: 2708-7123), vol. 2, no. 3, pp. 98-102, 2021.
X. Zhang, Y. Wang, N. Mou, and W. Liang, "Propagating both trust and distrust with target differentiation for combating link-based web spam," ACM Transactions on the Web (TWEB), vol. 8, no. 3, pp. 1-33, 2014.
Downloads
Published
How to Cite
Issue
Section
License
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License (CC-By) that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
This work is licensed under a Creative Commons Attribution License CC BY