HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain’s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

Publication
In 13th International Conference on Learning Representations, Workshop on Scaling Self-Improving Foundation Models without Human Supervision (ICLR 2025 SSI-FM)

Keywords:

Contrastive Learning, Hierarchical Labels, Retrieval-Augmented Generation, Embedding Models, Document Clustering

Citation:

Bhattarai, M., Barron, R., Eren, M.E., Vu, M., Grantcharov, V., Boureima, I., Stanev, V., Matuszek, C., Valtchinov, V., Rasmussen, K. and Alexandrov, B.. HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning. Under review at the In ICLR ’25 SSI-FM Workshop: 13th International Conference on Learning Representations, Workshop on Scaling Self-Improving Foundation Models without Human Supervision, Apr. 21, 2025, Singapore. 10 pages.

BibTeX:

@article{bhattarai2024heal,
  title={HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning},
  author={Bhattarai, Manish and Barron, Ryan and Eren, Maksim and Vu, Minh and Grantcharov, Vesselin and Boureima, Ismael and Stanev, Valentin and Matuszek, Cynthia and Valtchinov, Vladimir and Rasmussen, Kim and others},
  journal={arXiv preprint arXiv:2412.04661},
  year={2024}
}
Maksim E. Eren
Maksim E. Eren
Scientist

My research interests lie at the intersection of the machine learning and cybersecurity disciplines, with a concentration in tensor decomposition.