Skip to content

FizzySoap/Search-Engine-IR-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom Information Retrieval and Search Engine Architecture Boolean and Ranked Retrieval with BST-Based Reverse Indexing Project Overview

This project is a high-performance Search Engine implemented in Java, engineered from the ground up to handle unstructured text data. To demonstrate a deep understanding of data structures, this system avoids the Java Collections Framework in favor of custom-built Abstract Data Types (ADTs), providing low-level control over memory and algorithmic efficiency. Core Engineering Features

  1. Hybrid Indexing Architecture

The system utilizes a "Reverse Index" (Inverted Index) to map terms to their occurrences across the dataset. I implemented two distinct strategies to evaluate performance:

InvertedIndexBST: Organizes index nodes into a Binary Search Tree, reducing term lookup complexity to O(log u), where u is the number of unique words.

InvertedIndexNode: A specialized container that pairs a unique term with its document frequency and a dynamic list of Document IDs.
  1. Algorithmic Query Optimizations

    Two-Pointer Boolean Logic: For AND and OR queries, the processor uses a two-pointer traversal method. This avoids nested loops, achieving a linear time complexity of O(n + m), where n and m are the lengths of the document lists.

    Ranked Retrieval (TF-Scoring): A relevance-ranking engine based on Term Frequency. It calculates how often terms appear in documents and orders results via a custom descending sort algorithm.

    Coalesced Chaining: Advanced collision resolution implemented for hash-based structures to maintain high data locality and minimize lookup overhead.

  2. Pre-processing Pipeline

    Text Normalization: Includes a TextProcessor for regex-based cleaning, including punctuation removal and case-insensitivity.

    Stop-word Filtering: Integrates a StopWordLoader that handles a custom list of 500+ noise words to optimize the index size and improve search precision.

Technical Specifications

Lookup Complexity: O(log u) using BST.

Query Complexity: O(n + m) for intersections and unions using the Two-Pointer approach.

Data Structures: Custom-built Lists, BSTs implementations for complete control over the execution environment.

Usage and Customization

The engine is a generalized framework designed for experimentation with different datasets:

Custom Dataset: Replace dataset.csv with your own data (Format: ID, Content).

Custom Filtering: Modify stop.txt to update the noise-reduction pipeline.

User Interface & Interaction

Hybrid Interaction Model: The system utilizes a Java Swing JFileChooser for intuitive dataset selection, paired with a high-performance Command Line Interface (CLI) for iterative query execution and performance profiling.
Screenshot 2026-03-31 154742 Screenshot 2026-03-31 154707

About

A custom-built Information Retrieval system featuring Boolean and Ranked retrieval with BST-based inverted indexing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors