PDF Field Extraction & Detection

Parse AcroForm field definitions from fillable PDFs using PyMuPDF, normalize field objects, and merge split fields into a coherent form structure

Overview

PDF Field Extraction is the step that converts a PDF file's raw AcroForm widget data into the structured field objects Instafill.ai uses for filling, validation, and automation. When you upload a fillable PDF — an IRS 1040, a USCIS I-130, a CMS-1500, or any other AcroForm document — the extraction function extract_pdf_field_names() reads every widget annotation via PyMuPDF and returns a record for each field containing its name, type, label, page position, character limit, current value, and available choices.

Those raw records are then normalized into field objects by build_field_object(), typed into text or checkbox categories, and passed through merge_split_fields(), which groups horizontally aligned single-line fields that belong to the same logical entry — a common pattern in government forms where a single label like "Applicant's Name" spans three adjacent narrow boxes for last name, first name, and middle initial. The output is the field list that drives every fill session, API call, and batch operation.

Key Capabilities

  • extract_pdf_field_names() via PyMuPDF: Returns {widget_name, field_type, field_label, page_num, text_maxlen, field_value, choice_values} for every AcroForm widget in the document
  • Text and checkbox type classification: TEXT_FIELD_TYPES = ["Text", "ComboBox", "Time", "Date", "Number"]; CHECKBOX_FIELD_TYPES = ["CheckBox", "RadioButton"]
  • Normalized field objects: build_field_object() produces {id, name, form_name, form_type, page_num, form_max_length, form_value, form_choices} for each widget
  • Split-field merging: merge_split_fields() groups fields by Y-position within a configurable tolerance, identifies table cells, and merges horizontally aligned single-line fields via a shared abstract_field_id
  • Pre-filled field extraction: get_static_fields() reads fields that already contain values, skipping barcode widgets and OFF-state checkboxes
  • Multi-page support: Page number is recorded per field; forms spanning dozens of pages (e.g., SF-86 security clearance) are fully extracted in a single pass
  • choice_values for dropdowns and radio groups: ComboBox and RadioButton widgets expose their allowed values, enabling validation and structured filling

How It Works

  1. PDF upload: A fillable PDF is uploaded to the workspace. PyMuPDF opens the document and iterates over all pages.

  2. Widget enumeration with extract_pdf_field_names(): For each page, PyMuPDF enumerates AcroForm widget annotations. Each widget yields a record:

    • widget_name — the internal PDF field name (e.g., "topmostSubform[0].Page1[0].f1_1[0]" on an IRS form)
    • field_type — the AcroForm type string: "Text", "CheckBox", "RadioButton", "ComboBox", "Time", "Date", or "Number"
    • field_label — the tooltip or alternate text, if set
    • page_num — 0-indexed page number
    • text_maxlen — maximum character length for text fields (0 = unlimited)
    • field_value — current field value, if pre-filled
    • choice_values — list of allowed options for ComboBox and RadioButton widgets
  3. Type bucketing: Fields are assigned to one of two categories. TEXT_FIELD_TYPES covers ["Text", "ComboBox", "Time", "Date", "Number"]; CHECKBOX_FIELD_TYPES covers ["CheckBox", "RadioButton"]. This determines how filling logic handles each field — text fields receive string values, checkbox fields receive boolean or on/off values.

  4. Field object construction with build_field_object(): Each raw widget record is normalized into a field object: {id, name, form_name, form_type, page_num, form_max_length, form_value, form_choices}. form_type maps from the AcroForm type to Instafill.ai's internal type taxonomy. form_choices is populated from choice_values for ComboBox and RadioButton fields.

  5. Split-field merging with merge_split_fields(): Many government forms split a single logical field across multiple adjacent narrow widgets. For example, a date field on a USCIS form might be three separate AcroForm widgets for month, day, and year, all on the same baseline. merge_split_fields() groups fields by Y-position within a pixel tolerance, detects which groups represent table cells versus independent split fields, and assigns a shared abstract_field_id to horizontally aligned single-line fields in the same group. This allows the fill layer to treat them as a unit when mapping source data.

  6. Static field extraction with get_static_fields(): A separate pass using get_static_fields() reads fields that already have values in the PDF — for example, form version numbers, pre-printed agency codes, or read-only instruction text rendered as field values. Barcode widgets are skipped (their values are generated programmatically). CheckBox fields in the OFF state are also skipped, since they carry no data content.

  7. Field list output: The complete list of build_field_object() records, with abstract_field_id assignments from merge and static values from the pre-fill pass, is stored as the form's field definition. This is the object that drives fill sessions, API responses at /api/forms/{formId}/fields, and batch automation.

Use Cases

Field extraction is the prerequisite step for any form that will be filled more than once. IRS and government form libraries (1040, 1040-SR, W-4, I-9, SF-86) are extracted once to build a field catalog with type, position, and character limit metadata for each widget, enabling repeated programmatic fills without re-parsing the PDF. CMS-1500 and UB-04 healthcare claim forms are extracted so billing systems can map patient demographic and service line data directly to named field IDs. Legal teams extract USCIS benefit forms — I-130, I-485, I-765, I-864 — so immigration software can fill complete application packets from a single client profile. Mortgage originators extract Fannie Mae 1003 and HUD-1 settlement statement fields so loan origination systems can populate them from structured loan data.

Benefits

  • No manual field mapping: extract_pdf_field_names() reads widget names, types, positions, and character limits directly from the PDF — there is no schema to maintain manually
  • Accurate type classification: Separating TEXT_FIELD_TYPES from CHECKBOX_FIELD_TYPES ensures fill logic applies the correct value format for each widget, preventing type mismatches that corrupt AcroForm output
  • Split-field awareness: merge_split_fields() handles the split-date and split-name patterns ubiquitous in government forms, so autofill can populate "01 / 15 / 1985" across three adjacent widgets from a single date value
  • Character limit enforcement: text_maxlen from extract_pdf_field_names() is surfaced in the field object as form_max_length, so filling logic can truncate or warn before writing a value that exceeds the field's PDF-defined limit
  • Static value preservation: get_static_fields() captures pre-printed form metadata and read-only values so they are not overwritten during programmatic fills

Security & Privacy

Field extraction reads AcroForm widget metadata — names, types, positions, character limits, and choice lists. It does not copy the substantive content of pre-filled sensitive fields (SSNs, dates of birth, financial amounts) into any external store; those values are read only by get_static_fields() to identify read-only widgets and are not indexed or logged. Data is scoped to workspaceId and protected via the shared JWT authentication middleware running in both the .NET and Python service layers. Extracted field definitions are stored under workspace-level access controls and are not accessible to other workspaces. All extraction events are written to the audit log with user ID, workspace ID, form ID, and timestamp.

Common Questions

Does field extraction work with scanned PDFs?

extract_pdf_field_names() operates on AcroForm widget annotations in the PDF structure. A scanned PDF that has been run through the flat-to-fillable conversion pipeline will have AcroForm widgets written by write_widgets.py, and those widgets are extracted normally. A raw scanned PDF with no AcroForm layer has no widget annotations to extract; in that case the document must first go through flat-to-fillable conversion to generate the widget layer before field extraction can run.

How long does field extraction take?

PyMuPDF's widget enumeration is fast for standard AcroForm PDFs. A 2-page CMS-1500 with 33 fields extracts in under 2 seconds. A 10-page I-485 with approximately 100 fields extracts in 5–10 seconds. The merge_split_fields() grouping step adds negligible time. Forms with hundreds of fields (SF-86, IRS 1040 with all schedules) may take 20–40 seconds. Extraction runs once per form upload; results are cached so subsequent fill sessions do not re-parse the PDF.

What if field extraction misses a field or gets something wrong?

Fields that extract_pdf_field_names() does not return are fields that have no AcroForm widget annotation in the PDF — they cannot be extracted because they do not exist in the PDF structure. In that case, the document needs flat-to-fillable conversion to synthesize those widgets. Fields that are extracted but have incorrect type assignments can be corrected in the field management interface; corrections are stored in the form template and applied to all subsequent fill sessions.

Does extraction support XFA forms?

XFA (XML Forms Architecture) forms used by some government agencies embed form definitions in an XML stream rather than AcroForm annotations. PyMuPDF can read the rendered widget positions from some XFA forms, but the dynamic field generation rules (fields that appear or disappear based on user input) are not executed. For XFA forms that require dynamic rendering, the recommended approach is to render the form to a static PDF first — which flattens the XFA layer — and then run flat-to-fillable conversion to synthesize AcroForm widgets.

Can extraction identify field types automatically?

For AcroForm PDFs, field type is read directly from the widget annotation — extract_pdf_field_names() returns the exact type string set by the form author ("Text", "CheckBox", "RadioButton", "ComboBox", "Date", "Number", "Time"). There is no inference required. The type bucketing into TEXT_FIELD_TYPES and CHECKBOX_FIELD_TYPES is a classification step that maps these raw type strings to Instafill.ai's fill logic categories.

For flat PDFs processed through conversion, the type assigned during write_widgets.py execution becomes the AcroForm type that extraction subsequently reads.

What happens with tables in forms?

merge_split_fields() includes table cell detection. When it groups fields by Y-position and finds a dense grid of fields — multiple rows at consistent Y-intervals with consistent X-positions — it identifies the structure as a table and assigns row and column metadata to each field object. For example, the CMS-1500 service line section (boxes 21–29) is detected as a 6-row table with date, procedure code, modifier, diagnosis pointer, charge, and unit columns. This metadata is used by batch fill operations to map tabular source data (CSV rows, database query results) directly to table row fields without manual column mapping.

Can I export extracted field data?

Yes. Extracted field definitions are accessible via the REST API at /api/forms/{formId}/fields and returned as JSON objects with the full {id, name, form_name, form_type, page_num, form_max_length, form_value, form_choices} structure from build_field_object(). This JSON output can be used to build custom filling integrations, generate field catalogs for internal tooling, or validate that a new version of a form has the same field structure as the previous version.

Related Features

Ready to get started?

Start automating your form filling process today with Instafill.ai

Try Instafill.ai View Pricing