Automatic Query Expansion based Document Retrieval System using Hyper Tuned Graph Enhanced Bi-directional Encoder Representation from Transformers with Manifold Ranking Algorithm

D. Y. B. Priyadarshini; S. Aquter Babu

doi:10.62760/iteecs.4.4.2025.164

Authors

D. Y. B. Priyadarshini Department of Computer Science and Technology, Dravidian University, Kuppam – 517425, India https://orcid.org/0009-0000-3930-9206
S. Aquter Babu Department of Computer Science and Technology, Dravidian University, Kuppam – 517425, India https://orcid.org/0009-0005-1415-5411

DOI:

https://doi.org/10.62760/iteecs.4.4.2025.164

Keywords:

Eccentricity Based Keyword Extraction, Artificial Gorilla Troops Optimization Algorithm, HT- GEBERT, Automatic query expansion, manifold ranking, Pseudo Adversarial Embedding

Abstract

Document retrieval system automatic query expansion (AQE) adds relevant phrases to user searches to improve search results. AQE enhances recollection but introduces irrelevant phrases, increases computing complexity, and slows retrieval. Machine learning and deep learning models may be computationally expensive and sluggish in AQE-based document retrieval. These models also need large datasets and careful tweaking, which may be resource-intensive. The suggested AQE model uses HT-GEBERT Enhanced to generate contextual query expansions and Manifold Ranking to order words by relevance. This combination improves document retrieval accuracy and reduces query drift. Start by augmenting the corpus user query using Pseudo Adversarial Embedding (PAE) to increase data variety for robust model training. To maintain model analysis consistency, the augmented text and response are pre-processed using tokenization, lemmatization, acronym expansion, stop word removal, hyperlink removal, and spell correction. Next, Eccentricity-Based Keyword Extraction (EKE) extracts key phrases. After keyword extraction, the Hyper-Tuned Graph Enhanced Bidirectional Encoder Representation from Transformers (HT-GEBERT) model vectorises words and optimizes its hyper parameters using the Artificial Gorilla Troops Optimization Algorithm. Finally, a ranking-based query expansion approach re-ranks the phrase using a manifold ranking algorithm and splits the text into pieces for cosine similarity relevance assessment to obtain the document. The suggested method achieves 94% accuracy, 94.5% PPV, and 5.5% FDR in dataset 1. The suggested method uses query expansion, embedding, and optimization to retrieve documents.

References

P. N. R. Okhrati, S. Guan and V. Chang, “Knowledge Graph and Deep Learning-based Text-to-GraphQL Model for Intelligent Medical Consultation Chatbot”, Information systems Frontiers, Vol. 26, pp. 137-156, 2024.

https://doi.org/10.1007/s10796-022-10295-0

G. Singh, N. Mittal and S. S. Chouhan “A deep learning framework for multi-document summarization using LSTM with improved Dingo Optimizer (IDO)”, Multimedia Tools and Applications, Vol. 83, pp. 69669-69691, 2024.

https://doi.org/10.1007/s11042-024-18248-2

V. Deepak, and S. Kumar “Automatic Query Expansion for Enhancing Document Retrieval System in Healthcare application using GAN Based Embedding and Hyper-tuned DAEBERT Algorithm”, Data & Knowledge Engineering, Vol. 160, art. no. 102468, 2025.

https://doi.org/10.1016/j.datak.2025.102468

M. Esposito, E. Damiano, A. Minutolo, G. D. Pietro and H. Fujita, “Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering”, Information Sciences, Vol. 514, pp. 88-105, 2020.

https://doi.org/10.1016/j.ins.2019.12.002

I. Safder , S. U. Hassan , A. Visvizi , T. Noraset, R. Nawaz and S. Tuarob “Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents”, Information processing and management, Vol. 57, No. 6, art. no. 102269, 2020.

https://doi.org/10.1016/j.ipm.2020.102269

M. Bidoki, M. R. Moosavi and M. Fakhrahmad, “A semantic approach to extractive multi-document summarization: Applying sentence expansion for tuning of conceptual densities”, Information processing and management, Vol. 57, No. 6, art. no. 102341, 2020.

https://doi.org/10.1016/j.ipm.2020.102341

J. Guo, Y. Fan, L. Pang, L. Yang , Q. H. Zamani, C. Wu, W. B. Croft and X. Cheng “A Deep Look into neural ranking models for information retrieval”, Information Processing and Management, Vol. 57, No. 6, art. no. 102067, 2020.

https://doi.org/10.1016/j.ipm.2019.102067

Z. Chu, J. Yu and A. Hamdulla, “A novel deep learning method for query task execution time prediction in graph database”, Future Generation Computer Systems, Vol. 112, pp. 534-548, 2020.

https://doi.org/10.1016/j.future.2020.06.006

K. Taha, P. D. Yoo, C. Yeun and A. Taha “Text Classification Techniques: A Holistic Review, Observational Analysis, and Experimental Investigation”, Big Data Mining and Analytics, Vol. 8, No. 3, pp. 624-660, 2025.

https://doi.org/10.26599/BDMA.2024.9020092

K. Munir and M. S. Anjum, “The use of ontologies for effective knowledge modelling and information retrieval”, Applied Computing and Informatics, Vol. 14, No. 2, pp. 116-126, 2018.

https://doi.org/10.1016/j.aci.2017.07.003

L. Massai, “Evaluation of semantic relations impact in query expansion-based retrieval systems”, Knowledge Based Systems, Vol. 283, art. no. 111183, 2024.

https://doi.org/10.1016/j.knosys.2023.111183

K. Sugathadasa, B. Ayesha, N. de Silva, A. S. Perera, V. Jayawardana, D. Lakmal and M. Perera, “Legal Document Retrieval using Document Vector Embeddings and Deep Learning”, Computers Science and Engineering, Vol. 176, pp. 160-175, 2018.

https://doi.org/10.48550/arXiv.1805.10685

X. Wang, C. MacDonald and N. Tonellotto, “ColBERT-PRF: Semantic Pseudo-Relevance Feedback for Dense Passage and Document Retrieval”, ACM Transactions on the Web, Vol. 17, No. 1, art. no. 3, pp. 1-39, 2023.

https://doi.org/10.1145/3572405

W. Ou and V. N. Huynh “Conditional variational autoencoder for query expansion in ad-hoc information retrieval”, Information Sciences, Vol. 653, art. no. 119764, 2024.

https://doi.org/10.1016/j.ins.2023.119764

L. M. D. Campos, J. M. F. Luna, J. F. Huete, F. J. R. and N. Bolaños “Information Retrieval and Machine Learning Methods for Academic Expert Finding”, Algorithms, Vol. 17, No. 2, art. no. 51, 2023.

https://doi.org/10.3390/a17020051

H. Bolat and B. ?en “Document Retrieval System for Biomedical Question Answering”, Applied Sciences, Vol. 14, No. 6, art no. 2613, 2024.

https://doi.org/10.3390/app14062613

A. P. Bhopale and A. Tiwari, “Transformer based contextual text representation framework for intelligent information retrieval”, Expect Systems with Applications, Vol. 238, No. 15, art no. 121629, 2024.

https://doi.org/10.1016/j.eswa.2023.121629

M. Kim and P. Kang, “Text Embedding Augmentation Based on Retraining With Pseudo-Labeled Adversarial Embedding”, IEEE Access, Vol. 10, pp. 8363-8376, 2024.

https://doi.org/10.1109/ACCESS.2022.3142843

K. Gurugubelli, S. Mohamed and R. Krishna “Comparative Study of Tokenization Algorithms for End to End Open Vocabulary Keyword Detection”, IEEE International conference on acoustics speech and signal processing, Vol. 18, pp. 14-16, 2024.

https://doi.org/10.1109/ICASSP48485.2024.10445876

R. Hafeez, M. W. Anwar, M. H. Jamal, T. Fatima, J. C. M. Espinosa, L. A. D. López, E. B. Thompson and I. Ashraf “Contextual Urdu Lemmatization Using Recurrent Neural Network Models”, Mathematics, Vol. 11, No. 2, art. no. 435, 2023.

https://doi.org/10.3390/math11020435

T. I. Amosa, L. I. B. Izhar, P. Sebastian, I. B. Ismail, O. Ibrahim and S. L. Ayinla “Clinical Errors From Acronym use in Electronic Health Record”, IEEE Access, Vol. 11, pp. 59297-59316, 2023.

https://doi.org/10.1109/access.2023.3284682

S. M. Uma, O. Koleoso, I. Umoga, M. Alassad and N. Agarwal “The Multi-attribute impact of hyperlinks in blogs: an emotion-centric approach”, Social Network Analysis and Mining, Vol. 14, art. no. 134, 2024.

https://doi.org/10.1007/s13278-024-01295-w

G. Song, Z. Wu, G. Pundak, A. Chandorkar, K. Joshi and X. Velez, “Contextual Spelling Correction with Large Language Models”, IEEE o Automatic Speech Recognition and understanding workshop, Vol. 23, pp. 16-20, 2024.

https://doi.org/10.1109/ASRU57964.2023.10389637

N. Rajkumar, T. S. Subashini, K. Rajan and V. Ramalingam “Tamil Stop word Removal Based on Term Frequency”, Data Engineering and Communication Technology, Vol. 1079, pp. 21-30, 2020.

https://doi.org/10.1007/978-981-15-1097-7_3

D. A. V. Oliveros, P. S. Gomes, E. E. Milios and L. Berton “A multi-centrality index for graph-based keyword extraction”, Information Processing & Management, Vol. 56, No. 6, art. no. 102063, 2019.

https://doi.org/10.1016/j.ipm.2019.102063

Y. Yang and X. Cui, “Bert-Enhanced Text Graph Neural Network for Classification”, Entropy, Vol. 23, No. 11, art. no. 1536, 2019.

https://doi.org/10.3390/e23111536

M. A. E. Dabah, M. H. Hassan, S. Kamel and H. M. Zawbaa “Robust Parameters Tuning of Different Power System Stabilizers Using a Quantum Artificial Gorilla Troops Optimizer”, IEEE Access, Vol. 10, pp. 82560-82579, 2022.

https://doi.org/10.1109/ACCESS.2022.3195892

A. Ahmed and S. J. Malebary “Query Expansion Based on Top-Ranked Images for Content-Based Medical Image Retrieval”, IEEE Access, Vol. 8, pp. 194541-194550, 2020.

https://doi.org/10.1109/ACCESS.2020.3033504

https://huggingface.co/datasets/bookcorpus

https://huggingface.co/datasets/lapp0/query_expansion

https://ir-datasets.com/