API Reference
Core Functionality
- pdfbeaver.process(pdf: Pdf, options: ProcessingOptions | None = None, registry: HandlerRegistry | None = None, pages: None | int | Page | List[int | Page] = None, page: None | int | Page = None) None[source]
High-level entry point to modify PDF content.
- Parameters:
pdf – The
pikepdf.Pdfobject to process.options – Configuration options.
registry – The
HandlerRegistryto use. Defaults to the globaldefault_registry.pages – The pages to process. Can be a single integer (0-indexed), a single Page object, a list of integers/Pages, or None (processes all pages).
page – Alias for
pages(kept for backward compatibility).
- Raises:
TypeError – If
pdfis not a pikepdf object orpagescontains invalid types.
- pdfbeaver.modify_page(pdf: Pdf, page: Page, handler: HandlerRegistry, options: ProcessingOptions | None = None) None[source]
Modifies a PDF page and (optionally) its Form XObjects in-place.
This function parses the page’s content stream, tracks the graphics and text state, and applies the user-defined logic from the
handlerregistry.- Parameters:
pdf – The owning
pikepdf.Pdfdocument. Required to create new stream objects when writing back modified content.page – The
pikepdf.Pageto modify.handler – A
HandlerRegistryinstance containing the registered operator callbacks.options – Configuration options. If
None, defaults are used.
- Returns:
The page is modified in-place.
- Return type:
None
- class pdfbeaver.ProcessingOptions(optimize: bool = True, recurse_xobjects: bool = True, tracker_class: ~typing.Type[~pdfbeaver.state_tracker.StateTracker] = <class 'pdfbeaver.state_tracker.StateTracker'>, tracker_args: ~typing.Tuple = <factory>, tracker_kwargs: ~typing.Dict[str, ~typing.Any] = <factory>, visited_streams: ~typing.Set[int] = <factory>)[source]
Configuration options for the stream modification process.
- optimize: bool = True
If True, runs a peephole optimizer on the output stream to remove dead stores and consolidate arithmetic (e.g., combining absolute text matrices into relative moves). Defaults to True.
- recurse_xobjects: bool = True
If True, recursively descends into and modifies Form XObjects found in the page resources. Defaults to True.
- tracker_args: Tuple
Positional arguments passed to the
tracker_classconstructor.
- tracker_class
alias of
StateTracker
- tracker_kwargs: Dict[str, Any]
Keyword arguments passed to the
tracker_classconstructor.
- visited_streams: Set[int]
Internal set used to prevent infinite recursion in malformed PDFs with cyclic XObject references.
Registry & Handlers
- class pdfbeaver.HandlerRegistry[source]
A user-friendly registry for stream handlers. Implements the StreamHandler protocol (consumed by StreamEditor) but allows function-based registration with flexible signatures.
- handle_operator(op: str, operands: List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], context: StreamContext, raw_bytes: bytes) List[Tuple[List[bool | int | float | Decimal | Name | String | bytes | Array | Dictionary | PdfInlineImage | None], Operator] | bytes][source]
Standard entry point called by StreamEditor.
- property modified_operators: Set[str]
Returns the set of operators registered for interception.
- register(*ops: str)[source]
Decorator to register a function for specific operators.
The decorated function can accept any combination of the following arguments (detected by name):
argsorargumentsoroperands:List[NormalizedOperand]context:StreamContextraw_bytes:bytesoporoperator:strpdf:pikepdf.Pdfpage:pikepdf.Page
Example
@registry.register("Tj", "TJ") def my_handler(operands, context): ...
- Parameters:
*ops – One or more operator strings (e.g., “Tj”, “Do”, “re”) to intercept.
State Tracking
- class pdfbeaver.StateTracker[source]
State tracker. Tracks the CTM (Graphics) and Text Matrices.
This tracker acts as a bridge between the underlying
pdfminerstate machine and thepdfbeavercontext. It ingests snapshots of the state provided by the iterator and makes them accessible in a clean, pythonic format.- get_current_user_pos() ndarray[source]
Returns the (x, y) position of the cursor in User Space.
Calculated as: Origin(0,0) x Tm x CTM.
- Returns:
A 3-element vector [x, y, 1] representing the cursor position.
- Return type:
np.ndarray
- class pdfbeaver.StreamContext(pre_input: Dict[str, Any] | None, post_input: Dict[str, Any] | None, pdf: Pdf | None = None, page: Page | None = None, container: Object | None = None, tracker: Any = None)[source]
Context passed to handlers during stream processing.
- pre_input
State snapshot before the current operator ran.
- Type:
Dict[str, Any]
- post_input
State snapshot after the current operator ran.
- Type:
Dict[str, Any]
- tracker
Reference to the active state tracker instance.
- Type:
- editor
Reference to the parent editor instance.
- Type:
StreamEditor