Web Page Ranking Based on Text Content and Link Information Using Data Mining Techniques
Abstract
Thanks to the rapid expansion of the Internet, anyone can now access a vast array of information online. However, as the volume of web content continues to grow exponentially, search engines face challenges in delivering relevant results. Early search engines primarily relied on the words or phrases found within web pages to index and rank them. While this approach had its merits, it often resulted in irrelevant or inaccurate results. To address this issue, more advanced search engines began incorporating the hyperlink structures of web pages to help determine their relevance. While this method improved retrieval accuracy to some extent, it still had limitations, as it did not consider the actual content of web pages. The objective of the work is to enhance Web Information Retrieval methods by leveraging three key components: text content analysis, link analysis, and log file analysis. By integrating insights from these multiple data sources, the goal is to achieve a more accurate and effective ranking of relevant web pages in the retrieved document set, ultimately enhancing the user experience and delivering more precise search results the proposed system was tested with both multi-word and single-word queries, and the results were evaluated using metrics such as relative recall, precision, and F-measure. When compared to Google’s PageRank algorithm, the proposed system demonstrated superior performance, achieving an 81% mean average precision, 56% average relative recall, and a 66% F-measure.
Downloads
References
Afolabi, I.T., Makinde, O.S., and Oladipupo, O.O., 2019. Semantic web mining for content-based online shopping recommender systems. International Journal of Intelligent Information Technologies, 15(4), pp.41-56. DOI: https://doi.org/10.4018/IJIIT.2019100103
Al-Anzi, F., and Abuzeina, D., 2020. Enhanced latent semantic indexing using cosine similarity measures for medical application. International Arab Journal of Information Technology, 17(5), pp.742-749. DOI: https://doi.org/10.34028/iajit/17/5/7
Alhaidari, F., Alwarthan, S., and Alamoudi, A., 2020. User preference based weighted page ranking algorithm. In: ICCAIS 2020-3rd International Conference on Computer Applications and Information Security, pp.1-6. DOI: https://doi.org/10.1109/ICCAIS48893.2020.9096823
Ali, F., and Khusro, S., 2021. Content and link-structure perspective of ranking webpages: A review. Computer Science Review, 40, p.100397. DOI: https://doi.org/10.1016/j.cosrev.2021.100397
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., and Kochut, K., 2017. A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. Journal of Intelligent Information Systems, 2017, 1(1), pp.1-13.
Ghani, W.A., and Hussain, A., 2021. Applying similarity measures to improve query expansion. Iraqi Journal of Science, 62(6), pp.2053-2063. DOI: https://doi.org/10.24996/ijs.2021.62.6.31
Guwta, M., 2021. Information Retrieval for Silt’e Text Using Latent Semantic Indexing. M.C. Thesis. Bahir Dar University.
Hazarika, D., Konwar, D., and Bora, D.J., 2020. Sentiment Analysis on Twitter by Using TextBlob for Natural Language Processing. In: Proceedings of the International Conference on Research in Management and Technovation 2020. Vol. 24, pp.63-67. DOI: https://doi.org/10.15439/2020KM20
Ilo, P.I., Nkiko, C., Izuagbe, R., and Furfuri, I.M.M., 2023. Course Guide Lis 303 Information Retrieval (Cataloguing ii). National Open University of Nigeria, Nsukka.Thakur, N., Mehrotra, D., Bansal A., and Bala M., 2019. Comparative analysis of ranking functions for retrieving information from medical repository. Malaysian Journal of Computer Science, 32(1), pp.18-30. DOI: https://doi.org/10.22452/mjcs.vol32no1.2
Jain, S., Jain, S.C., and Vishwakarma, S.K., 2020. Analysis of text classification with various term weighting schemes in vector space model. International Journal of Innovative Technology and Exploring Engineering, 9(10), pp.390-393. DOI: https://doi.org/10.35940/ijitee.D1938.0891020
Jain, S., Vishwakarma, S., and Jain, S.C., 2023. Analysis of term weighting schemes in vector space model for text classification. Journal of Integrated Science and Technology, 11(2), p.469.
Joby, P.P., 2020. Expedient information retrieval system for web pages using the natural language modelling. Journal of Artificial Intelligence and Capsule Networks, 2(2), pp.100-110. DOI: https://doi.org/10.36548/jaicn.2020.2.003
Kleinberg, J.M., 2011. Authoritative sources in a hyperlinked environment. In: The Structure and Dynamics of Networks. Princeton University Press, Princeton, pp.514-542. DOI: https://doi.org/10.1515/9781400841356.514
Lu, J., Henchion, M., and Namee, B.M., 2020. Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks. In: LREC 2020-12th International Conference on Language Resources and Evaluation, Conference Proceedings. Vol. 2, pp.6740-6744.
Mustafa, A.B., Ghulam, S.K., Naadiya, M., and Sheeba, M., 2022. Web content mining techniques for structured data: A review. Sindh Journal of Headways in Software Engineering, 1(1), pp.1-10.
Nassar, M.O., Kanaan, G., and Awad, H.A.H., 2010. Comparison between Different Global Weighting Schemes. In: Proceedings of the International MultiConference of Engineers and Computer Scientists 2010, IMECS 2010. Vol. I, pp.690-692.
Patel, S.H., and Desai, A.A., 2019. Link analysis to discover relevant documents using information retrieval. International Journal of Computer Applications, 178(10), pp.23-27. DOI: https://doi.org/10.5120/ijca2019918827
Payal, L.S., 2020. A study of different web mining types. Anveshana’s International Journal of Research in Engineering and Applied Sciences, 5(3), pp.30-33.
Phyu, A.P., and Thu, E.E., 2021. Short survey of data mining and web mining using cloud computing. International Journal of Advanced Networking and Applications, 12(05), pp.4725-4731. DOI: https://doi.org/10.35444/IJANA.2021.12509
Qi, Q., Hessen, D.J., and van der Heijden, P.G.M., 2023. Improving Information Retrieval Through Correspondence Analysis Instead of Latent Semantic Analysis. Journal of Intelligent Information Systems, 2023, 1(1), pp.1-44. DOI: https://doi.org/10.1007/s10844-023-00815-y
Rathi, R.N., and Mustafi, A., 2023. The importance of term weighting in semantic understanding of text: A review of techniques. Multimedia Tools and Applications, 82(7), pp.9761-9783. DOI: https://doi.org/10.1007/s11042-022-12538-3
Reddy, K.P., Reddy, T.R., Naidu, G.A., and Vardhan, B.V., 2018. Impact of similarity measures in information retrieval. International Journal of Computational Engineering Research, 8(6), pp.54-59.
Robert, B., and Brown, E.B., 2004. The PageRank Citation Ranking: Bringing Order to the Web. Vol. 1, University of Pennsylvania, Philadelphia, PA, pp.1-14.
Shahmirzadi, O., Lugowski, A., and Younge, K., 2019. Text Similarity in Vector Space Models: A Comparative Study. In: Proceeding-18th IEEE International Conference on Machine Learning and Applications, ICMLA 2019, pp.659-666. DOI: https://doi.org/10.1109/ICMLA.2019.00120
Sharma, D., Shukla, R., Giri, A.K., and Kumar, S., 2019. A Brief Review on Search ENGINE Optimization. In: Proceedings of the 9th International Conference On Cloud Computing, Data Science and Engineering, Confluence 2019, pp.687-692. DOI: https://doi.org/10.1109/CONFLUENCE.2019.8776976
Sharma, P.S., Yadav, D., and Garg, P., 2020. A systematic review on page ranking algorithms. International Journal of Information Technology, 12(2), pp.329-337. DOI: https://doi.org/10.1007/s41870-020-00439-3
Sharma, P.S., Yadav, D., and Thakur, R.N., 2022. Web page ranking using web mining techniques: A comprehensive survey. Mobile Information Systems, 2022, p.7519573. DOI: https://doi.org/10.1155/2022/7519573
Tyagi, N., and Gupta, S.K., 2018. Web structure mining algorithms: A survey. Advances in Intelligent Systems and Computing, 654, pp.305-317. DOI: https://doi.org/10.1007/978-981-10-6620-7_30
Wang, J., and Dong, Y., 2020. Measurement of text similarity: A survey. Information, 11(9), p.421. DOI: https://doi.org/10.3390/info11090421
Wu, H., and Gu, X., 2014. Reducing Over-weighting in Supervised Term Weighting for Sentiment Analysis. In: COLING 2014-25th International Conference on Computational Linguistics, Proceedings of COLING 2014: Technical Papers, pp.1322-1330.
Xing, W., and Ghorbani, A., 2004. Weighted PageRank Algorithm. In: Proceedings-Second Annual Conference on Communication Networks and Services Research, pp.305-314. DOI: https://doi.org/10.1109/DNSR.2004.1344743
Zheng, W., and Fang, H., 2010. ARetrieval System based on Sentiment Analysis. HCIR. [Preprint].
Copyright (c) 2024 Esraa Q. Naamha, Matheel E. Abdulmunim
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Authors who choose to publish their work with Aro agree to the following terms:
-
Authors retain the copyright to their work and grant the journal the right of first publication. The work is simultaneously licensed under a Creative Commons Attribution License [CC BY-NC-SA 4.0]. This license allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors have the freedom to enter into separate agreements for the non-exclusive distribution of the journal's published version of the work. This includes options such as posting it to an institutional repository or publishing it in a book, as long as proper acknowledgement is given to its initial publication in this journal.
-
Authors are encouraged to share and post their work online, including in institutional repositories or on their personal websites, both prior to and during the submission process. This practice can lead to productive exchanges and increase the visibility and citation of the published work.
By agreeing to these terms, authors acknowledge the importance of open access and the benefits it brings to the scholarly community.