API Reference

Core Functionality

High-level entry point to modify PDF content.

Parameters:

pdf – The pikepdf.Pdf object to process.
options – Configuration options.
registry – The HandlerRegistry to use. Defaults to the global default_registry.
pages – The pages to process. Can be a single integer (0-indexed), a single Page object, a list of integers/Pages, or None (processes all pages).
page – Alias for pages (kept for backward compatibility).

Raises:

TypeError – If pdf is not a pikepdf object or pages contains invalid types.

pdfbeaver.modify_page(pdf: Pdf, page: Page, handler: HandlerRegistry, options: ProcessingOptions | None = None) → None[source]

Modifies a PDF page and (optionally) its Form XObjects in-place.

This function parses the page’s content stream, tracks the graphics and text state, and applies the user-defined logic from the handler registry.

Parameters:

pdf – The owning pikepdf.Pdf document. Required to create new stream objects when writing back modified content.
page – The pikepdf.Page to modify.
handler – A HandlerRegistry instance containing the registered operator callbacks.
options – Configuration options. If None, defaults are used.

Returns:

The page is modified in-place.

Return type:

None

class pdfbeaver.ProcessingOptions(optimize: bool = True, recurse_xobjects: bool = True, tracker_class: ~typing.Type[~pdfbeaver.state_tracker.StateTracker] = <class 'pdfbeaver.state_tracker.StateTracker'>, tracker_args: ~typing.Tuple = <factory>, tracker_kwargs: ~typing.Dict[str, ~typing.Any] = <factory>, visited_streams: ~typing.Set[int] = <factory>)[source]

Configuration options for the stream modification process.

optimize: bool = True: If True, runs a peephole optimizer on the output stream to remove dead stores and consolidate arithmetic (e.g., combining absolute text matrices into relative moves). Defaults to True.

recurse_xobjects: bool = True: If True, recursively descends into and modifies Form XObjects found in the page resources. Defaults to True.

tracker_args: Tuple: Positional arguments passed to the tracker_class constructor.

tracker_class: alias of StateTracker

tracker_kwargs: Dict[str, Any]: Keyword arguments passed to the tracker_class constructor.

visited_streams: Set[int]: Internal set used to prevent infinite recursion in malformed PDFs with cyclic XObject references.

Registry & Handlers

class pdfbeaver.HandlerRegistry[source]

A user-friendly registry for stream handlers. Implements the StreamHandler protocol (consumed by StreamEditor) but allows function-based registration with flexible signatures.

handle_operator(op: str, operands: List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], context: StreamContext, raw_bytes: bytes) → List[Tuple[List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], Operator] | bytes][source]: Standard entry point called by StreamEditor.

property modified_operators: Set[str]: Returns the set of operators registered for interception.

register(*ops: str)[source]

Decorator to register a function for specific operators.

The decorated function can accept any combination of the following arguments (detected by name):

args or arguments or operands: List[NormalizedOperand]
context: StreamContext
raw_bytes: bytes
op or operator: str
pdf: pikepdf.Pdf
page: pikepdf.Page

Example

@registry.register("Tj", "TJ")
def my_handler(operands, context):
    ...

Parameters:: *ops – One or more operator strings (e.g., “Tj”, “Do”, “re”) to intercept.

State Tracking

class pdfbeaver.StateTracker[source]

State tracker. Tracks the CTM (Graphics) and Text Matrices.

This tracker acts as a bridge between the underlying pdfminer state machine and the pdfbeaver context. It ingests snapshots of the state provided by the iterator and makes them accessible in a clean, pythonic format.

get_current_user_pos() → ndarray[source]

Returns the (x, y) position of the cursor in User Space.

Calculated as: Origin(0,0) x Tm x CTM.

Returns:: A 3-element vector [x, y, 1] representing the cursor position.
Return type:: np.ndarray

get_matrices() → Tuple[ndarray, ndarray][source]

Calculates the effective transformation matrices.

Returns:

A tuple containing:

The CTM (3x3 numpy array).
The Text Render Matrix (CTM x TM) (3x3 numpy array).

Return type:

Tuple[np.ndarray, np.ndarray]

get_snapshot() → Dict[str, Any][source]: Returns a snapshot of the current state.

set_state(state: Dict[str, Any])[source]: Updates the internal state to match the snapshot provided by the iterator. This is the ‘Passive Tracking’ model: we trust the engine (pdfminer).

Context passed to handlers during stream processing.

pre_input

State snapshot before the current operator ran.

Type:: Dict[str, Any]

post_input

State snapshot after the current operator ran.

Type:: Dict[str, Any]

tracker

Reference to the active state tracker instance.

Type:: StateTracker

editor

Reference to the parent editor instance.

Type:: StreamEditor

Utilities

pdfbeaver.extract_text_position(state: Dict[str, Any]) → ndarray[source]: Calculates the absolute (x, y, 1) position from the graphics state. Requires ‘tstate’ (Text State) and ‘ctm’ (Current Transformation Matrix). Handles both List-based CTM (pdfminer) and Numpy-based CTM.