Skip to content

Latest commit

 

History

History
173 lines (117 loc) · 4.27 KB

File metadata and controls

173 lines (117 loc) · 4.27 KB
title Getting started
description Parse your first PDF with Parxy, explore the unified Document model, and learn how to extract plain text and Markdown from any supported driver.

Getting Started with Parxy

Parxy is a unified Python interface for document parsing.

This tutorial will guide you step-by-step through:

  1. Parsing your first document with Parxy
  2. Loading a PDF and extracting its text
  3. Understanding the unified Document model returned by all parsers

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Install and use Parxy as a Python library
  • Parse documents with a single function call
  • Access structured text through Parxy's unified data model
  • Convert parsed documents to plain text or Markdown

Installation

Install Parxy from PyPI (or your development package):

pip install parxy

You can also install optional parser backends depending on your needs (e.g. PyMuPDF, Unstructured, LlamaParse):

pip install parxy[llama]

Step 1 — Parse Your First Document

Let's start by parsing a simple PDF.

The easiest way is to use the Parxy.parse() method, which automatically selects the default parser (usually pymupdf).

from parxy_core.facade.parxy import Parxy

# Parse a document from a local file path
doc = Parxy.parse("samples/example.pdf")

# Print basic information
print(f"Pages: {len(doc.pages)}")
print(f"Title: {doc.metadata.title}")

You can also specify a parser explicitly:

doc = Parxy.parse("samples/example.pdf", driver_name=Parxy.PYMUPDF)

Or even pass an in-memory file:

import io

with open("samples/example.pdf", "rb") as f:
    pdf_bytes = io.BytesIO(f.read())

doc = Parxy.parse(pdf_bytes)

Each parser requires a configurations that can be specified through enviroment variables. Refers to config.py for details.

Step 2 — Extract Text

Once parsed, the returned object is a Document model — a structured representation of your file.

You can access its text content in different ways:

Get all text as a single string

text = doc.text()
print(text[:500])  # print first 500 characters

Convert the document to Markdown

markdown = doc.markdown()
print(markdown[:500])

This method preserves headings, paragraphs, and lists (when identified by the parser).

Step 3 — Explore the Unified Document Model

Every parser in Parxy returns the same structure, built with Pydantic:

Document
 ├── Metadata
 ├── Page[]
 │   ├── TextBlock[]
 │   │   ├── Line[]
 │   │   │   ├── Span[]
 │   │   │   │   ├── Character[]
 │   │   │   │   └── ...
 │   ├── ImageBlock[]
 │   └── TableBlock[]
 └── Outline[]

Example:

page = doc.pages[0]
first_block = page.blocks[0]

print(first_block.text)
print(first_block.bbox)
print(first_block.category)

What Happens Under the Hood

When you call:

doc = Parxy.parse("file.pdf")

Parxy performs the following steps:

  1. Initializes a singleton DriverFactory
  2. Selects the appropriate driver (e.g. PyMuPDF)
  3. Invokes the driver's .parse() method
  4. Returns a normalized Document object with consistent structure

This means you can switch parsers (e.g., from PyMuPDF to LlamaParse) without changing how you handle the output.

Summary

In this tutorial you:

  • Installed and imported Parxy
  • Parsed a document with a single line of code
  • Extracted text and Markdown
  • Explored the unified document model

You're now ready to try more advanced use cases, such as:

Tip

If your parsed text seems incomplete or misaligned, try a different driver:

doc = Parxy.parse("file.pdf", driver_name=Parxy.UNSTRUCTURED_LIBRARY)

Each backend may specialize in different document types.