Form Upload & Conversion

Upload any PDF and automatically extract form fields for intelligent filling

Overview

Uploading a PDF triggers create_form() in the Python processing service. The function calls extract_pdf_components(), which returns two data structures for each page: the list of detected form fields (with type, name, and bounding-box coordinates) and the extracted text content. Field lists from adjacent pages are then passed through merge_split_fields(), which collapses fields whose bounding boxes span a page boundary into a single logical field.

After extraction, try_to_clone_form() checks whether this PDF already exists in the system. The primary clone check uses a field-based hash computed from the AcroForm widget structure; if that hash matches an existing form record, the new upload reuses the already-processed field definitions without repeating the extraction pipeline. A secondary check uses flat_pdf_hash — a visual fingerprint of the rendered pages — as a fallback for flat PDFs that have no AcroForm fields. This deduplication means that commonly uploaded government forms (W-9, I-9, W-4, CMS-1500) are typically served from the existing record rather than re-extracted, reducing processing time to under a second for recognized forms.

For flat or scanned PDFs where neither hash matches, the system routes to flat-to-fillable conversion to detect field regions using the AI pipeline before proceeding.

Real-World Example: Kona Ice automated Michigan STFU permits for 800+ franchise locations. They uploaded the state permit form once, then used batch processing to fill permits for all locations from a single spreadsheet.

Key Capabilities

  • AcroForm Field Extraction: extract_pdf_components() reads all AcroForm widget annotations, returning field name, type, and {page, x0, y0, x1, y1} coordinates for each field
  • Text Extraction Per Page: In addition to fields, extract_pdf_components() returns the text content of each page, used later for AI context during field filling
  • Cross-Page Field Merging: merge_split_fields() detects fields split across page boundaries and merges them into a single field record
  • Hash-Based Clone Detection: A field-hash computed from AcroForm structure enables fast deduplication; flat_pdf_hash provides a visual fallback for flat PDFs
  • Deduplication for Government Forms: Recognized forms (W-9, I-9, W-4, 1003 mortgage application, CMS-1500, I-485) are reused across all workspaces, avoiding redundant extraction
  • Field Type Classification: Fields are classified as TEXT_FIELD_TYPES (Text, ComboBox, Time, Date, Number) or CHECKBOX_FIELD_TYPES (CheckBox, RadioButton) during extraction
  • Multi-Page Support: Forms with hundreds of pages are supported; extract_pdf_components() processes each page independently before merging
  • Form Catalog Access: The catalog of pre-converted popular forms serves the field-hash lookup before any new extraction is attempted

How It Works

  1. Upload PDF: The user submits a PDF via drag-and-drop or file picker. The file is validated for format integrity and page count before processing begins.

  2. Component Extraction: extract_pdf_components() runs on the uploaded file. For each page it returns:

    • A list of field objects: {name, type, page, x0, y0, x1, y1, field_id}
    • The raw text content of that page
  3. Field Merging: merge_split_fields() compares bounding boxes across adjacent pages and merges any field whose coordinates indicate it crosses a page break.

  4. Clone Check — Field Hash: try_to_clone_form() computes a hash over the sorted AcroForm widget data. If the hash matches an existing form record, the system clones that record's field definitions and skips the remainder of the pipeline.

  5. Clone Check — Visual Hash: If no field-hash match is found (common for flat or scanned PDFs), try_to_clone_form() falls back to flat_pdf_hash, a perceptual hash of the rendered page images.

  6. Flat PDF Handling: If neither hash matches and the PDF has no AcroForm fields, the file is forwarded to the flat-to-fillable AI pipeline, which detects field regions and returns synthetic field records in the same coordinate format.

  7. Template Creation: The resolved field list is written to the form record in the database, scoped to the uploading workspace's workspaceId. The form is immediately available for filling sessions or batch processing.

Benefits

  • Fast Processing for Known Forms: Hash deduplication means popular government forms (W-9, I-9, W-4, CMS-1500) are available in under a second — no re-extraction per workspace
  • One-Time Setup: Upload once, use in unlimited filling sessions or batch runs without re-uploading
  • Accurate Field Coordinates: Bounding-box coordinates from AcroForm widgets are pixel-precise, giving the visual editor exact field placement
  • Cross-Page Field Integrity: merge_split_fields() prevents split fields from appearing as two disconnected fields in long multi-section forms like the 1003 mortgage application
  • Flexible Input: Fillable PDFs, flat PDFs, and scanned documents all enter the same downstream field management workflow

Real-World Example: A healthcare organization improved their core algorithm performance by 30% after the August 2025 update, reducing form conversion time from 90 to 60 seconds for complex medical credentialing forms.

Security & Privacy

Data is scoped to workspaceId and protected via the shared JWT authentication middleware running in both the .NET and Python service layers. The field-hash deduplication shares extracted field structure (field names, types, and coordinates) across workspaces for recognized public forms, but never shares filled data or workspace-specific metadata. Uploaded PDFs are stored in workspace-isolated storage; the flat_pdf_hash comparison uses a rendered image hash and does not expose page content to other workspaces.

Common Questions

What if my form doesn't have fillable fields?

If extract_pdf_components() returns an empty field list and no matching flat_pdf_hash is found in the catalog, the system routes the PDF to the flat-to-fillable AI pipeline. That pipeline renders each page as an image, runs field-region detection, and returns synthetic field records using the same {page, x0, y0, x1, y1} coordinate schema as native AcroForm fields. See Flat-to-Fillable Conversion for details on how field regions are detected.

How long does conversion take?
  • Forms that match an existing field-hash or flat_pdf_hash in the catalog: under 1 second (clone path, no re-extraction)
  • Simple fillable forms (1–5 pages): 5–15 seconds for extract_pdf_components() plus field merging
  • Standard forms (5–20 pages): 15–45 seconds
  • Complex or flat forms (20+ pages requiring AI field detection): 45–120 seconds

Related Features

Ready to get started?

Start automating your form filling process today with Instafill.ai

Try Instafill.ai View Pricing