GitHub - FizzySoap/Search-Engine-IR-System: A custom-built Information Retrieval system featuring Boolean and Ranked retrieval with BST-based inverted indexing.

Custom Information Retrieval and Search Engine Architecture Boolean and Ranked Retrieval with BST-Based Reverse Indexing Project Overview

This project is a high-performance Search Engine implemented in Java, engineered from the ground up to handle unstructured text data. To demonstrate a deep understanding of data structures, this system avoids the Java Collections Framework in favor of custom-built Abstract Data Types (ADTs), providing low-level control over memory and algorithmic efficiency. Core Engineering Features

Hybrid Indexing Architecture

The system utilizes a "Reverse Index" (Inverted Index) to map terms to their occurrences across the dataset. I implemented two distinct strategies to evaluate performance:

InvertedIndexBST: Organizes index nodes into a Binary Search Tree, reducing term lookup complexity to O(log u), where u is the number of unique words.

InvertedIndexNode: A specialized container that pairs a unique term with its document frequency and a dynamic list of Document IDs.

Algorithmic Query Optimizations

Two-Pointer Boolean Logic: For AND and OR queries, the processor uses a two-pointer traversal method. This avoids nested loops, achieving a linear time complexity of O(n + m), where n and m are the lengths of the document lists.

Ranked Retrieval (TF-Scoring): A relevance-ranking engine based on Term Frequency. It calculates how often terms appear in documents and orders results via a custom descending sort algorithm.

Coalesced Chaining: Advanced collision resolution implemented for hash-based structures to maintain high data locality and minimize lookup overhead.
Pre-processing Pipeline

Text Normalization: Includes a TextProcessor for regex-based cleaning, including punctuation removal and case-insensitivity.

Stop-word Filtering: Integrates a StopWordLoader that handles a custom list of 500+ noise words to optimize the index size and improve search precision.

Technical Specifications

Lookup Complexity: O(log u) using BST.

Query Complexity: O(n + m) for intersections and unions using the Two-Pointer approach.

Data Structures: Custom-built Lists, BSTs implementations for complete control over the execution environment.

Usage and Customization

The engine is a generalized framework designed for experimentation with different datasets:

Custom Dataset: Replace dataset.csv with your own data (Format: ID, Content).

Custom Filtering: Modify stop.txt to update the noise-reduction pipeline.

User Interface & Interaction

Hybrid Interaction Model: The system utilizes a Java Swing JFileChooser for intuitive dataset selection, paired with a high-performance Command Line Interface (CLI) for iterative query execution and performance profiling.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
build		build
dist		dist
docs		docs
nbproject		nbproject
src/searchengine		src/searchengine
README.md		README.md
build.xml		build.xml
dataset.csv		dataset.csv
manifest.mf		manifest.mf
stop.txt		stop.txt
test cases.txt		test cases.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages