Flat-to-Fillable PDF Conversion

Transform scanned documents and non-interactive PDFs into fully fillable AcroForm PDFs using a dual-detector geometry pipeline and LLM-based false-positive removal

Overview

Many widely used forms — including CMS-1500 health insurance claims, the USCIS I-485 adjustment of status application, Fannie Mae 1003 mortgage applications, and state court pleading templates — are distributed only as flat PDFs with no AcroForm field definitions. Instafill.ai converts these documents into interactive fillable PDFs without requiring access to the original design files.

The conversion pipeline uses two detection engines in sequence. The primary detector, detect_boxes_fitz(), runs on PyMuPDF and locates table borders, underlines, and red-bordered boxes directly from the PDF vector data. When that detector finds insufficient fields, a fallback detector, detect_blanks(), runs on pdfplumber and identifies blank areas by analyzing whitespace patterns in the text layer. Checkboxes are handled separately by _find_checkboxes(), which uses pdfplumber to locate small square regions near label text. Detected field geometry is then passed through an 8-step normalization pipeline, false positives are removed by an LLM reviewing page renderings, and the final output is a standard AcroForm PDF alongside a JSON field list and an annotated review PDF.

Key Capabilities

  • Dual-engine field detection: PyMuPDF primary (detect_boxes_fitz()) with pdfplumber fallback (detect_blanks()), so the system degrades gracefully when a PDF lacks vector borders
  • Dedicated checkbox detector: _find_checkboxes() via pdfplumber identifies checkbox regions independently from text-field detection
  • 8-step geometry pipeline: Synthesizes, carves, adjusts, nudges, truncates, and offsets raw field rectangles before writing AcroForm widgets
  • LLM false-positive removal: Pages rendered as images are passed to get_nonsense_fields_from_page(), which removes fields placed on decorative borders, logos, or instructions text
  • AcroForm output: Widgets written by write_widgets.py produce standard AcroForm fields compatible with Adobe Acrobat Reader, browser PDF viewers, and PDF automation tools
  • Annotated review PDF: A second PDF is generated with colored overlays on each detected field for human review before the form enters production
  • Structured field manifest: JSON output records {page, x0, y0, x1, y1, field_id} for every field, enabling downstream programmatic filling

How It Works

  1. Upload flat PDF: The source document — a scanned CMS-1500, a flat I-485, a printed 1003, or any image-based form — is uploaded and queued for processing.

  2. Primary detection with detect_boxes_fitz(): PyMuPDF parses the PDF vector stream, extracting table cell borders, horizontal underlines, and colored (e.g., red) bounding boxes that visually delimit fill areas. This works on digitally created flat PDFs without requiring OCR.

  3. Fallback detection with detect_blanks(): If the primary pass returns fewer than the expected field count, pdfplumber analyzes the text layer to locate contiguous blank spans — areas of whitespace bounded by label text on one or both sides. This handles scanned or image-only PDFs where no vector geometry is present.

  4. Checkbox detection with _find_checkboxes(): A separate pdfplumber pass identifies small square regions (typically 8–14 pt) that appear adjacent to yes/no labels, option lists, or boolean prompts. These are classified as CheckBox or RadioButton widgets.

  5. 8-step geometry normalization:

    • synthesize_fields_for_colon_labels() — creates field rectangles after colon-terminated labels where no border exists
    • carve_inline_fields_distance() — splits long detected regions at label boundaries within the same line
    • adjust_height_to_line_and_gap() — resizes field height to match the surrounding line height and inter-line gap
    • synthesize_fields_from_table_cells() — generates individual fields for each cell detected inside table structures
    • nudge_up_until_blank() — shifts field tops upward until whitespace is found, avoiding overlap with printed text
    • truncate_overlaps_left_to_right() — resolves horizontal overlaps between adjacent fields, left-precedence
    • offset_fields_y() — applies a global vertical offset calibrated to each PDF's coordinate system
    • drop_degenerate_fields() — removes zero-area, negative-dimension, or out-of-bounds rectangles
  6. LLM false-positive removal: Each page is rendered as a raster image and sent to get_nonsense_fields_from_page(). The LLM reviews the visual layout and returns IDs of fields that land on decorative elements, section titles, or header art rather than actual fill areas. Those fields are removed before widget creation.

  7. AcroForm widget creation: write_widgets.py writes each surviving field as an AcroForm widget at its normalized coordinates, assigning field type (Text, CheckBox, RadioButton), a generated field name derived from nearby label text, and a field_id that ties back to the JSON manifest.

  8. Output delivery: Three artifacts are produced — the fillable AcroForm PDF, the {page, x0, y0, x1, y1, field_id} JSON manifest, and the annotated review PDF with field overlays for human verification.

Use Cases

Flat-to-fillable conversion is applied whenever an organization must fill a form repeatedly but only has access to a static PDF. Immigration attorneys convert scanned USCIS forms (I-485, I-765, I-131) into fillable PDFs so client data can be mapped programmatically across the entire packet. Healthcare billing departments convert CMS-1500 claim forms so patient and service data from practice management systems can be written directly into AcroForm fields. Mortgage lenders convert Fannie Mae 1003 applications so loan officer data entry systems can populate borrower sections without manual typing. Court administrators convert state-specific pleading templates that courts distribute only as flat PDFs, enabling law firms to autofill case captions, party names, and filing metadata.

Benefits

  • No source files required: Conversion operates on the PDF you receive, not the original InDesign or Word source — useful for government and court-issued forms where source files are never distributed
  • One-time cost, unlimited reuse: A CMS-1500 or 1003 is converted once; the resulting AcroForm template is used for every subsequent fill session without reconversion
  • Field manifest for automation: The JSON {page, x0, y0, x1, y1, field_id} output integrates directly with API-based filling workflows, so converted forms can be filled programmatically without a UI
  • LLM review reduces cleanup: False-positive removal via get_nonsense_fields_from_page() cuts the number of incorrectly placed fields before human review, typically leaving fewer than 5 fields requiring manual adjustment on a standard government form
  • AcroForm compatibility: Output is standard AcroForm, not a proprietary format — the filled PDF works in Adobe Reader, browser viewers, and any PDF processing library

Security & Privacy

Conversion processing analyzes document structure and page geometry; it does not extract or store the semantic content of any pre-filled text fields on the source PDF. Rendered page images passed to the LLM false-positive checker are transmitted only for that inference call and are not retained. Data is scoped to workspaceId and protected via the shared JWT authentication middleware running in both the .NET and Python service layers. Original flat PDFs are preserved unmodified; all conversion artifacts (fillable PDF, JSON manifest, review PDF) are stored under the same workspace access controls as the source document. All conversion events are written to the audit log with user ID, workspace ID, form ID, and timestamp.

Common Questions

How accurate is automatic field detection?

Accuracy depends on the source PDF type. Digitally created flat PDFs with vector borders (the kind produced by printing a Word or InDesign document to PDF) produce 95–98% field detection accuracy from detect_boxes_fitz() alone because the border geometry is embedded in the PDF stream. Scanned forms at 300 DPI processed by detect_blanks() typically achieve 90–95% accuracy. Scans below 200 DPI or with significant skew or background noise drop to 80–85%.

The LLM false-positive pass (get_nonsense_fields_from_page()) then removes fields incorrectly placed on logos, decorative borders, or header text, which usually accounts for 2–8 fields on a complex multi-page form. The annotated review PDF shows every surviving field with its boundaries overlaid so reviewers can spot and correct remaining misplacements before the template is saved.

What if field detection makes mistakes?

The annotated review PDF is the primary correction surface. Reviewers can see each detected field boundary against the original form content. Fields can be moved, resized, deleted, or retyped (e.g., changing a Text field to a CheckBox) through the field management interface. Manually added or corrected fields are written back through the same write_widgets.py path, so the resulting AcroForm is consistent regardless of how many adjustments were made. Corrections are saved to the form template and apply to all future fill sessions without reconversion.

How long does conversion take?

The bottleneck is the LLM false-positive pass, which renders each page as an image before making an inference call. For a standard 2-page CMS-1500: 15–30 seconds total. For a 10-page I-485: 1–3 minutes. For a 30-page 1003 or SF-86: 4–8 minutes. Conversion runs asynchronously; the workspace receives a notification when the fillable PDF and JSON manifest are ready.

Can I convert forms with tables?

Yes. synthesize_fields_from_table_cells() is a dedicated step in the geometry pipeline specifically for table structures. It identifies the row and column grid of a detected table and generates one field rectangle per cell. For example, a CMS-1500 service line table with columns [Date of Service, Place of Service, Procedure Code, Modifier, Diagnosis Pointer, Charges, Units] across 6 service rows produces 42 individual fields. Fields are named with row and column indices so they map predictably in programmatic filling workflows.

Does conversion work with checkboxes and radio buttons?

Checkboxes and radio buttons are detected by _find_checkboxes(), which runs independently of the text-field detectors. It locates small square regions (typically 8–14 pt on a standard government form) and examines adjacent label text to determine grouping. Squares with a shared question stem and mutually exclusive options are classified as RadioButton groups; independent squares are classified as CheckBox widgets. Both types are written as standard AcroForm widgets by write_widgets.py. Accuracy on clear printed forms is 92–96%; misclassifications (e.g., a decorative square box classified as a checkbox) are correctable in the review interface.

What formats can I convert from?

The pipeline accepts flat PDFs directly — whether digitally created or scanned. Image files (PNG, JPEG, TIFF, BMP) are first converted to PDF by the document conversion service using PIL/Pillow before entering the flat-to-fillable pipeline. Multi-page TIFFs are converted to multi-page PDFs. For best results with scanned documents, scan at 300 DPI with pages squared to the scanner bed.

Can I share converted fillable PDFs?

The output of conversion is a standard AcroForm PDF. It can be downloaded and used in any PDF viewer or filling tool, shared with workspace members as a reusable form template, or accessed programmatically via the API using the field_id values from the JSON manifest. The converted form is not locked to Instafill.ai — it is a portable PDF artifact.

Is there a limit to how many forms I can convert?

Conversion limits depend on your subscription plan:

  • Free Plan: 3 conversions per month
  • Starter Plan: 10 conversions per month
  • Professional Plan: 50 conversions per month
  • Enterprise Plan: Unlimited conversions

After converting a form once, filling it does not consume additional conversion credits. Conversion credits refresh monthly.

Related Features

Ready to get started?

Start automating your form filling process today with Instafill.ai

Try Instafill.ai View Pricing