AlgoCrawl

AlgoCrawl is an advanced non-headless web crawler built with Playwright and TypeScript designed for comprehensive website exploration. Using a dual-depth approach, it not only follows navigational links like a traditional crawler but also simulates user interactions to capture dynamically loaded content.

Overview

AlgoCrawl uses a dual-depth approach to explore web applications:

Page Depth:
Every navigable page (reached via links or form submissions) is analyzed. The crawler compiles interactive elements (forms, links, buttons, etc.) to create a baseline for exploration.
Dynamic Depth:
Within each page, AlgoCrawl simulates user interactions by clicking elements and submitting forms. It monitors for dynamic DOM changes (e.g., AJAX responses or HTML5 updates) and, upon detecting any, recursively explores the updated state.

This approach makes AlgoCrawl ideal for security assessments, performance testing, and in-depth website analysis.

Features

Dual-Depth Crawling:
Combines traditional page crawling with in-page dynamic interaction testing.
Playwright Automation:
Utilizes Playwright for efficient, headless browser automation.
Optimized Element Interaction:
Supports multiple click strategies (standard, JavaScript-triggered, and position-based) to handle various types of interactive elements.
Customizable Configuration:
Easily modify settings such as depth limits, timeouts, proxy configurations, and interaction toggles using JSON.
Error Resilience & Proxy Support:
Comprehensive error handling and proxy configuration to navigate complex web environments.

Installation

Prerequisites

Node.js (v18 or later)

Steps

Clone the Repository:

git clone https://github.com/yourusername/AlgoCrawl.git
cd AlgoCrawl

Install Dependencies:
```
npm install
```

Usage

To start the crawler, run:

npm start

For development mode with auto-reloading:

npm run dev

By default, the crawler uses the configuration in config/default.json. To specify a custom configuration file, pass its path as a command-line argument:

npm start path/to/yourConfig.json

Configuration

Customize the crawler’s behavior via the config/default.json file. Key parameters include:

startUrl: The initial URL to begin crawling.
allowedDomains: A list of domains that the crawler is allowed to explore.
maxDepth & maxPagesPerDomain: Limits to prevent endless crawling.
headless & timeout: Configure the browser’s execution mode and network timeout settings.
interactWithForms & interactWithElements: Toggle dynamic interaction features.
Proxy Settings: Configure proxy details and HTTPS error handling.

Current Status & Progress Summary

Core Functionality

✅ Basic crawler setup with Playwright.
✅ Configurable options via JSON.
✅ Page analysis and detection of interactive elements.
✅ Smart element interaction using multiple click strategies.
✅ AJAX request handling and monitoring.
✅ Proxy support with SSL/HTTPS error management.

Recent Improvements

AJAX Handling

Smart tracking of AJAX requests and responses.
Dynamic network activity monitoring.
Detailed error and timing logs.
Automatic retry mechanisms for failed requests.

Interaction System

Multiple click strategies:
- Standard click.
- JavaScript-triggered click.
- Position-based click.
Efficient waiting mechanisms using network idle and response-based progression.
Post-interaction analysis to detect new dynamic elements.

Performance Optimizations

Removal of unnecessary fixed delays.
Enhanced URL normalization and deduplication.
Optimized request queue monitoring and smart timeout handling.

Error Handling

Comprehensive error capture and logging.
Detailed monitoring of connection errors and response statuses.
Improved error recovery mechanisms.

Current Limitations

Single-page crawling (depth management is pending).
Basic form interaction.
Limited JavaScript event simulation.

Project Description

AlgoCrawl is built around a dual-depth crawling strategy:

Overall Concept:
- Page Depth:
  Each navigable page (via links or forms) is treated as a separate unit with a baseline analysis.
- Dynamic Depth:
  Simulates user interactions on each page. Upon detecting dynamic DOM changes (via AJAX, HTML5 updates, etc.), the crawler reanalyzes the new state recursively.
Crawl Process:
- Initial Page Load:
  Opens the target URL in a headless browser, waits for the page to settle, and compiles interactive elements.
- Processing Forms & Links:
  Submits forms and follows in-scope navigational links, queuing new pages for further exploration.
- Simulated In-Page Interactions:
  Clicks every interactive element, waits for any DOM changes, and recursively explores new dynamic states.
Control and Management:
- Tracking and Limits:
  Maintains a record of visited states (using normalized snapshots or unique identifiers) to prevent redundant processing.
- Scoping & Filtering:
  Defines criteria to determine which pages or states are in-scope, avoiding redundant or out-of-scope interactions.
Error Handling and Logging:
- Implements robust timeout, retry, and state restoration strategies.
- Logs every interaction and outcome for auditing and debugging.
Integration & Extensibility:
- Modular design separates the crawling engine from dynamic interaction logic.
- Easily extensible for future additions such as vulnerability scanning and deeper content analysis.

Roadmap & To Do

Future Enhancements

Depth-First Crawling:
Implement improved URL queuing and state management for deeper exploration.
Advanced Form Interaction:
Enhance form handling with more sophisticated submission and validation techniques.
JavaScript Event Simulation:
Improve simulation of dynamic JavaScript events beyond basic click interactions.
Concurrent Page Handling:
Further optimize multi-threading with a refined worker pool management system.
Enhanced Error Recovery:
Refine error tracking, recovery, and state restoration mechanisms.

Known Issues & To Do

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Ensure your code is modular, well-documented, and follows project guidelines.
Open a pull request for review.

For more details, please see our Contribution Guidelines (if available).

License

AlgoCrawl is licensed under the MIT License. See the LICENSE file for details.

Happy Crawling!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
config		config
src		src
.cursorignore		.cursorignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CURRENT_STATUS.md		CURRENT_STATUS.md
LICENSE		LICENSE
ProjectDescription.md		ProjectDescription.md
README.md		README.md
SECURITY.md		SECURITY.md
package-lock.json		package-lock.json
package.json		package.json
roadmap.md		roadmap.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlgoCrawl

Table of Contents

Overview

Features

Installation

Prerequisites

Steps

Usage

Configuration

Current Status & Progress Summary

Core Functionality

Recent Improvements

AJAX Handling

Interaction System

Performance Optimizations

Error Handling

Current Limitations

Project Description

Roadmap & To Do

Future Enhancements

Known Issues & To Do

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AlgoCrawl

Table of Contents

Overview

Features

Installation

Prerequisites

Steps

Usage

Configuration

Current Status & Progress Summary

Core Functionality

Recent Improvements

AJAX Handling

Interaction System

Performance Optimizations

Error Handling

Current Limitations

Project Description

Roadmap & To Do

Future Enhancements

Known Issues & To Do

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages