API Reference

Core Functionality

pdfbeaver.process(pdf: Pdf, options: ProcessingOptions | None = None, registry: HandlerRegistry | None = None, pages: None | int | Page | List[int | Page] = None, page: None | int | Page = None) None[source]

High-level entry point to modify PDF content.

Parameters:
  • pdf – The pikepdf.Pdf object to process.

  • options – Configuration options.

  • registry – The HandlerRegistry to use. Defaults to the global default_registry.

  • pages – The pages to process. Can be a single integer (0-indexed), a single Page object, a list of integers/Pages, or None (processes all pages).

  • page – Alias for pages (kept for backward compatibility).

Raises:

TypeError – If pdf is not a pikepdf object or pages contains invalid types.

pdfbeaver.modify_page(pdf: Pdf, page: Page, handler: HandlerRegistry, options: ProcessingOptions | None = None) None[source]

Modifies a PDF page and (optionally) its Form XObjects in-place.

This function parses the page’s content stream, tracks the graphics and text state, and applies the user-defined logic from the handler registry.

Parameters:
  • pdf – The owning pikepdf.Pdf document. Required to create new stream objects when writing back modified content.

  • page – The pikepdf.Page to modify.

  • handler – A HandlerRegistry instance containing the registered operator callbacks.

  • options – Configuration options. If None, defaults are used.

Returns:

The page is modified in-place.

Return type:

None

class pdfbeaver.ProcessingOptions(optimize: bool = True, recurse_xobjects: bool = True, tracker_class: ~typing.Type[~pdfbeaver.state_tracker.StateTracker] = <class 'pdfbeaver.state_tracker.StateTracker'>, tracker_args: ~typing.Tuple = <factory>, tracker_kwargs: ~typing.Dict[str, ~typing.Any] = <factory>, visited_streams: ~typing.Set[int] = <factory>)[source]

Configuration options for the stream modification process.

optimize: bool = True

If True, runs a peephole optimizer on the output stream to remove dead stores and consolidate arithmetic (e.g., combining absolute text matrices into relative moves). Defaults to True.

recurse_xobjects: bool = True

If True, recursively descends into and modifies Form XObjects found in the page resources. Defaults to True.

tracker_args: Tuple

Positional arguments passed to the tracker_class constructor.

tracker_class

alias of StateTracker

tracker_kwargs: Dict[str, Any]

Keyword arguments passed to the tracker_class constructor.

visited_streams: Set[int]

Internal set used to prevent infinite recursion in malformed PDFs with cyclic XObject references.

Registry & Handlers

class pdfbeaver.HandlerRegistry[source]

A user-friendly registry for stream handlers. Implements the StreamHandler protocol (consumed by StreamEditor) but allows function-based registration with flexible signatures.

handle_operator(op: str, operands: List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], context: StreamContext, raw_bytes: bytes) List[Tuple[List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], Operator] | bytes][source]

Standard entry point called by StreamEditor.

property modified_operators: Set[str]

Returns the set of operators registered for interception.

register(*ops: str)[source]

Decorator to register a function for specific operators.

The decorated function can accept any combination of the following arguments (detected by name):

  • args or arguments or operands: List[NormalizedOperand]

  • context: StreamContext

  • raw_bytes: bytes

  • op or operator: str

  • pdf: pikepdf.Pdf

  • page: pikepdf.Page

Example

@registry.register("Tj", "TJ")
def my_handler(operands, context):
    ...
Parameters:

*ops – One or more operator strings (e.g., “Tj”, “Do”, “re”) to intercept.

State Tracking

class pdfbeaver.StateTracker[source]

State tracker. Tracks the CTM (Graphics) and Text Matrices.

This tracker acts as a bridge between the underlying pdfminer state machine and the pdfbeaver context. It ingests snapshots of the state provided by the iterator and makes them accessible in a clean, pythonic format.

get_current_user_pos() ndarray[source]

Returns the (x, y) position of the cursor in User Space.

Calculated as: Origin(0,0) x Tm x CTM.

Returns:

A 3-element vector [x, y, 1] representing the cursor position.

Return type:

np.ndarray

get_matrices() Tuple[ndarray, ndarray][source]

Calculates the effective transformation matrices.

Returns:

A tuple containing:
  1. The CTM (3x3 numpy array).

  2. The Text Render Matrix (CTM x TM) (3x3 numpy array).

Return type:

Tuple[np.ndarray, np.ndarray]

get_snapshot() Dict[str, Any][source]

Returns a snapshot of the current state.

set_state(state: Dict[str, Any])[source]

Updates the internal state to match the snapshot provided by the iterator. This is the ‘Passive Tracking’ model: we trust the engine (pdfminer).

class pdfbeaver.StreamContext(pre_input: Dict[str, Any] | None, post_input: Dict[str, Any] | None, pdf: Pdf | None = None, page: Page | None = None, container: Object | None = None, tracker: Any = None)[source]

Context passed to handlers during stream processing.

pre_input

State snapshot before the current operator ran.

Type:

Dict[str, Any]

post_input

State snapshot after the current operator ran.

Type:

Dict[str, Any]

tracker

Reference to the active state tracker instance.

Type:

StateTracker

editor

Reference to the parent editor instance.

Type:

StreamEditor

Utilities

pdfbeaver.extract_text_position(state: Dict[str, Any]) ndarray[source]

Calculates the absolute (x, y, 1) position from the graphics state. Requires ‘tstate’ (Text State) and ‘ctm’ (Current Transformation Matrix). Handles both List-based CTM (pdfminer) and Numpy-based CTM.