Skip to content

Perturbation operators

One operator per pair, applied to a copy of one file with a per-permutation derived seed, deterministic across re-runs (the controlled-variable invariant). traxr operators prints this catalog from the installed package.

Tabular (CSV / XLSX): any agent

Delivered by file round-trip: parse → perturb → re-serialize to disk.

operator what it does
column_swap swaps the values of two columns
label_corrupt corrupts categorical labels
data_type_corrupt breaks cell types (numbers → text, …)
row_duplicate duplicates rows
irrelevant_columns injects plausible distractor columns
unit_change rescales numeric columns as if units changed
null_content replaces the file with empty content

Text (TXT / MD): any agent

Also file round-trip.

operator what it does
ocr_noise character-level OCR-style corruption
number_corruption changes numeric values in running text
text_redaction blacks out spans
paragraph_shuffle reorders paragraphs
encoding_error mojibake-style encoding damage
section_removal deletes a section
null_content empty file

PDF: any agent

Delivered by surgical in-place edits (PyMuPDF): the perturbed PDF on disk differs only at the edit loci, preserving extraction fidelity everywhere else. Visual fidelity is not guaranteed at edit spans; agents doing layout/visual analysis may notice the reinserted font.

operator what it does
number_corruption edits numbers in place
text_redaction redacts spans in place
section_removal removes a contiguous block
page_removal drops a page
page_shuffle reorders pages
null_content blank document

PDF: built-in agent only

Whole-text-flow operators (ocr_noise, paragraph_shuffle, encoding_error) cannot be applied surgically to a PDF. For the built-in reference agent they are delivered by content injection: the perturbed extracted text is handed to the agent's PDF tool instead of the file being modified.