Autofill from Multiple Sources
Extract data from any document format and automatically populate form fields with AI intelligence
Overview
Autofill from Multiple Sources is the data ingestion layer of every form filling session. You upload source documents — resumes, insurance cards, pay stubs, prior forms, text pastes, spreadsheets — and the system extracts, normalizes, and maps their contents to form fields without manual transcription. Each source is processed into vector embeddings on upload, and source text is mapped to specific form page numbers. When the autofill pipeline fills a field group on page 3 of a credentialing packet, only source text from page-relevant sections reaches the AI — not the full document — keeping extraction focused and accurate across long, multi-part forms.
Sources can be combined freely within a session. Filling a 1003 mortgage application might use a W-2 for income, a bank statement for assets, and a previous 1003 for applicant identity fields — the AI resolves conflicts between sources (flagging where the same field has different values across documents) and draws from each source for the fields it best covers. Profile files from the source library are async-copied into the session before autofill_db_fields() runs, so saved sources from prior sessions are immediately available without re-upload.
Text source content is encrypted via handle_text_encryption() (PyCryptodome 3.19.0) with workspace-scoped keys before storage. For organizations that cannot retain source documents at all after processing, Stateless Mode deletes source data immediately on session completion.
Key Capabilities
- Universal Format Support: PDFs, Word documents, Excel spreadsheets, images, emails, plain text, CSV files, and more
- Intelligent Field Mapping: AI automatically identifies which source data belongs in which form field
- Multi-Source Fusion: Combine information from multiple documents to complete a single form
- Context Understanding: Recognizes field types (dates, addresses, names, numbers) and formats data appropriately
- OCR Integration: Google Cloud Vision API extracts text from scanned documents and images, with PIL/Pillow and OpenCV 4.9.0 for image pre-processing
- Page-Scoped Retrieval: Vector embeddings per source; source text mapped to specific form page numbers for focused extraction
- Email Submission: Forward documents via email to automatically trigger form filling
- Webhook Support: Push data from external systems to automatically create and fill forms
- Profile-Based Autofill: Async-copy saved profile files into sessions as reusable source data
- Bulk Session Creation: Process multiple source documents simultaneously to create many filled forms
How It Works
Source Upload: Add one or more source documents to your form filling session. Drag-and-drop files, paste text, forward emails, or send data via webhook or API. Profile files from your source library are async-copied in automatically when selected — no re-upload needed.
Document Processing: Each source is processed in parallel:
- Text extracted from PDF (PyMuPDF), Word (python-docx), Excel, and plain text
- Images and scanned PDFs processed via Google Cloud Vision API (OCR), with PIL/Pillow and OpenCV 4.9.0 for pre-processing (rotation correction, contrast normalization)
- Vector embeddings created per source; source text mapped to form page numbers for page-scoped retrieval
- Text content encrypted via
handle_text_encryption()with workspace-scoped keys before storage - Processing status available at
GET /api/sessions/{session_id}/process-sources-status
Intelligent Mapping: The AI creates connections between source data and form fields:
- Semantic matching: "Applicant Name" field matches "Name" from resume
- Pattern matching: Phone number formats identified and normalized
- Context awareness: "Start Date" in an employment section uses job start date, not document date
- Confidence scoring: Each mapping receives a confidence score; low-confidence fields flagged for review
Data Extraction & Transformation: Source data is extracted and formatted appropriately:
- Date format conversion (MM/DD/YYYY to DD-MMM-YYYY)
- Address parsing and restructuring
- Name formatting (First Name, Last Name from "Jane Doe")
- Text overflow handled via font-metric estimation +
REDUCE_FIELD_VALUE_AI_MODELfor fields where extracted text exceeds the maximum character length
Field Population: All form fields are filled with extracted data via the autofill pipeline's concurrent group dispatch. Fields with low confidence scores are flagged for your review in the visual editor.
Iterative Refinement: If fields are missing or incorrect, you can:
- Upload additional sources to provide missing information
- Correct fields in the visual editor — corrections are saved as examples that improve future fills on the same form
- Re-run autofill with updated sources
Use Cases
Multi-source autofill is used across regulated industries where forms require data pulled from many different documents. Dental offices combine insurance cards, photo IDs, and intake questionnaires to fill 47-field verification forms in under a minute, immigration law firms upload a dozen client documents to complete complex forms like the I-485 at high accuracy, and mortgage brokers receive completed loan applications in seconds by uploading pay stubs, tax returns, and bank statements rather than scheduling lengthy phone interviews.
Real-World Examples: ABA therapy providers extract patient data from insurance cards, referrals, and medical histories to complete authorization forms at scale. Teleradiology practices process physician licenses and certificates from multiple hospitals to complete credentialing packets in 30 minutes instead of 2 hours.
Benefits
- Eliminate Manual Transcription: No more reading source documents and typing into form fields
- Reduce Errors: Automatic extraction eliminates typos and transcription mistakes
- Save Time: Forms that took 30–60 minutes to fill manually now complete in under 2 minutes
- Handle Complexity: Process forms requiring data from 10+ different source documents
- Support Any Format: Accept whatever the client has — PDF, scan, spreadsheet, email
- Work at Scale: Process hundreds of documents and forms simultaneously with batch operations
- Maintain Compliance: Extraction creates audit trails showing data source for every filled field
- Improve Consistency: AI applies the same extraction and formatting logic across all forms
Security & Privacy
Source document handling includes multiple security layers:
- Encryption: Text source content encrypted via
handle_text_encryption()(PyCryptodome 3.19.0) with workspace-scoped keys stored in Azure Key Vault. Files stored in Azure Blob Storage viautils/azure.py. - Scope Restriction: Encryption includes scope metadata preventing decryption outside the originating workspace, even for internal service calls.
- Access Control: Source documents scoped to
workspaceId— accessible only to users with session permissions. Protected via the shared JWT authentication middleware in both the .NET and Python service layers. - Automatic Cleanup: Configure retention policies to automatically delete source documents after specified periods
- HIPAA Compliance: Healthcare organizations can enable HIPAA-compliant source handling with encrypted storage and audit logging
- Stateless Processing: Enable stateless mode to process sources without persistent storage (data deleted immediately after form filling)
- No AI Training: Source documents are never used to train AI models — processed only for your specific form filling task
Stateless Mode for Maximum Security
For organizations in regulated industries or those handling highly sensitive data, Instafill.ai offers Stateless Mode - a processing option that provides the highest level of data security by ensuring source documents are never stored persistently on our servers.
What is Stateless Mode?
In normal operation, Instafill.ai stores uploaded source documents for a configurable retention period (7-90 days by default) to enable:
- Historical session review and audit trails
- Reprocessing forms if corrections are needed
- Creating profiles from previously uploaded sources
- Compliance documentation linking filled forms to source data
Stateless Mode changes this behavior fundamentally:
- Source documents are processed only during the active filling session
- Once the form is filled and downloaded, source documents are immediately and permanently deleted
- No copies, backups, or cached versions are retained
- Form templates and field mappings remain saved, but source content is gone
Critical Distinction: Stateless mode deletes SOURCE DOCUMENTS (the data you upload), not FORM TEMPLATES (the blank PDF forms). Your form library and mapping configurations persist for future use - only the sensitive data is removed.
How Stateless Mode Works
Workflow:
- Session Creation: You enable "Stateless Mode" when creating a new form filling session
- Upload Sources: Upload sensitive documents (medical records, financial statements, personal information)
- AI Processing: Sources are processed in real-time to extract data and fill form
- Data in Memory: During processing, source data exists only in volatile memory, never written to persistent storage
- Form Completion: Filled form is generated and ready for download
- Immediate Deletion: Upon session completion (or 1 hour of inactivity), all source documents are permanently deleted from memory and temporary storage
- Verification: You receive deletion confirmation with cryptographic proof of data destruction
Timeline:
- Normal mode: Sources retained for 30-90 days
- Stateless mode: Sources deleted within seconds of session completion
When to Use Stateless Mode
Strongly Recommended For:
- Healthcare: HIPAA-covered entities processing patient health information (PHI) for medical forms, insurance claims, patient registration
- Financial Services: Banks and financial institutions handling tax returns, income verification, account statements, credit reports
- Legal: Law firms processing privileged attorney-client communications, sensitive case documents, confidential settlements
- Government: Agencies handling classified information, security clearance documents, sensitive personal data
- Compliance-Heavy Industries: Any organization subject to data retention restrictions or "right to be forgotten" regulations
Examples:
- Medical Billing Office: Filling insurance claim forms from patient medical records - PHI must not be retained longer than necessary
- Law Firm: Immigration applications using client passports, birth certificates, financial records - privileged and confidential
- Mortgage Broker: Loan applications from borrowers' tax returns, pay stubs, bank statements - highly sensitive financial data
- Government Contractor: Security clearance forms using classified personnel files - retention is prohibited
- HR Department: Employee onboarding forms from I-9 documents, background checks - minimize liability exposure
Stateless Mode vs Standard Mode
| Aspect | Standard Mode | Stateless Mode |
|---|---|---|
| Source Retention | 30-90 days (configurable) | 0 days (immediate deletion) |
| Session Review | Can review sources anytime | Sources unavailable after completion |
| Reprocessing | Reprocess forms from same sources | Must re-upload sources to reprocess |
| Audit Trail | Full trail linking forms to sources | Form metadata only (no source content) |
| Compliance | Standard data protection | Maximum data minimization |
| Cost | Included in all plans | Same price (no additional cost) |
| Profile Creation | Save sources as profiles | Must extract profile data before session ends |
| Use Case | General forms | Highly sensitive documents |
Enabling Stateless Mode
Per-Session Activation:
- Create new form filling session
- Before uploading sources, toggle "Enable Stateless Mode"
- Confirmation prompt: "Sources will be permanently deleted after form completion. Continue?"
- Upload sources and fill form as normal
- Upon completion, receive deletion confirmation
Workspace Default (Enterprise only):
- Admins can enable stateless mode by default for all sessions
- Users cannot disable stateless mode (enforced security policy)
- Useful for organizations with blanket data retention restrictions
API Parameter:
{
"form_id": "form_123",
"sources": ["source_456", "source_789"],
"stateless": true
}
Compliance Benefits
Stateless mode helps meet strict data retention requirements:
HIPAA (Healthcare):
- Minimum Necessary Standard: Only retain data as long as needed for immediate use
- Retention Limits: Automatic deletion after processing meets data minimization requirements
- Security Rule Compliance: Reduces breach risk by eliminating persistent storage of PHI
GDPR (EU Data Protection):
- Data Minimization (Article 5): Store only what's necessary for specific purpose
- Right to Erasure (Article 17): Immediate deletion supports "right to be forgotten"
- Storage Limitation: Stateless processing inherently limits storage duration
SOX (Financial Compliance):
- Reduced Liability: Minimizing retention of financial documents reduces regulatory exposure
- Audit Requirements: Form metadata and cryptographic deletion proof satisfy audit trails
CCPA (California Privacy):
- Data Collection Transparency: Users know their data is immediately deleted
- Deletion Requests: Automatic deletion exceeds consumer deletion request requirements
Limitations and Trade-Offs
What You Give Up with Stateless Mode:
- No Historical Review: Cannot review source documents after session completion
- No Reprocessing: Must re-upload sources if you need to regenerate the form
- Limited Audit Trail: Form shows which fields were filled but not the source content
- No Profile Creation After Fact: Must save profile during session; can't create later from historical sources
Mitigation Strategies:
- Save Profiles First: If you'll reuse source data, create profile before ending session
- Export Source Data: Download extracted field data as JSON for your records (field data only, not full source documents)
- Download Filled Forms Immediately: Don't rely on platform storage - save filled forms to your local systems
- Keep Local Copies: Maintain source documents in your secure local storage if future review is needed
Technical Implementation
Security Measures:
- In-Memory Processing: Sources processed in RAM only, never touching persistent disk
- Encrypted Memory: Memory pages containing source data are encrypted even during processing
- Secure Deletion: Multi-pass overwrite and cryptographic deletion upon completion
- No Backups: Stateless sessions excluded from automated backup systems
- Deletion Proof: Cryptographic hash of deletion operation provided as verification
Performance:
- Stateless mode adds <200ms overhead for deletion verification
- Processing speed identical to standard mode
- No impact on accuracy or functionality
Common Questions
What file formats are supported as sources?
Instafill.ai supports virtually all common document formats:
Documents:
- PDF (including scanned/image-based PDFs with OCR via Google Cloud Vision)
- Word (.docx, .doc, .docm)
- Text (.txt, .rtf)
Spreadsheets:
- Excel (.xlsx, .xls)
- CSV (comma-separated values)
- TSV (tab-separated values)
Images:
- JPEG/JPG
- PNG
- TIFF
- BMP
- HEIC (iPhone photos)
Other:
- Email (EML format or forward to submission address)
- Plain text (paste directly into chat)
Files up to 100MB per source are supported. For larger files, contact support for enterprise options.
How does the AI know which source data goes in which field?
The AI uses multiple techniques to create intelligent mappings:
Semantic Matching: Understanding meaning of field labels and source content. "Applicant's Full Name" matches "Name:" section in resume.
Pattern Recognition: Identifying data types. Phone numbers, dates, addresses, emails, SSNs are recognized by format.
Contextual Understanding: Using surrounding fields for context. "Start Date" in an employment section uses job start date, not document creation date.
Page-Scoped Retrieval: Source text is mapped to form page numbers — when filling fields on a specific page, only source text relevant to that page section reaches the AI, keeping extraction precise.
Fine-Tuned Examples: For forms you use frequently, corrections you make in the visual editor are saved as examples that improve accuracy for the same field in future sessions.
Confidence Scoring: Each mapping receives a confidence score. Low-confidence mappings are flagged for your review.
You can always override AI decisions and provide corrections, which improve accuracy for similar forms in the future.
Can I use multiple sources for a single form?
Yes! This is one of Instafill.ai's most powerful features. You can upload any number of sources to a single session, and the AI will intelligently combine information from all of them.
For example, filling a 1003 mortgage application might use:
- W-2 (PDF) → income and employer information
- Bank statement (PDF) → asset and account details
- Driver's license (image) → current address and ID verification
- Previous 1003 (PDF) → applicant identity fields from a prior application
The AI understands that different information comes from different sources, resolves conflicts by flagging fields where sources disagree, and creates a unified, complete form from all of them.
What happens if source data conflicts?
When the AI finds conflicting information across sources, it:
- Prioritizes Recent Data: Newer documents generally override older ones
- Flags Conflicts: Alerts you to fields with conflicting source data in the visual editor (yellow border)
- Provides Transparency: Shows which source was used for each field
- Enables Manual Resolution: Allows you to select which source to trust
For example, if your driver's license shows one address but your resume shows a different address, the AI flags this conflict and lets you choose which is correct before finalizing the PDF.
Can I automatically fill forms from incoming emails?
Yes! Each workspace has a unique email submission address. When you forward an email to this address:
- Email content and attachments become sources
- You specify which form template to use (or the AI suggests appropriate forms)
- A session is automatically created
- The form is filled using email content and attachments
- You receive a notification when the filled form is ready for review
This is particularly useful for workflows where clients regularly send information via email that needs to be transcribed into standard forms — intake packets, insurance applications, vendor contracts.
How accurate is data extraction from scanned documents or images?
Accuracy depends on image quality:
- Clear scans/photos (300+ DPI, good lighting): 98-99% accurate OCR
- Standard scans (200 DPI, normal lighting): 95-97% accurate
- Low-quality images (blurry, poor lighting, <150 DPI): 85-92% accurate
The system is robust to common issues like:
- Rotated or skewed images (auto-correction via OpenCV)
- Handwritten text (when legible)
- Multi-column layouts (preserves reading order)
- Mixed fonts and sizes
For critical documents, scan at 300 DPI or higher. The system will alert you if image quality is too low for reliable extraction.
Can I save extracted data for reuse in future forms?
Yes! After the AI extracts data from sources, you can:
- Create a Profile: Save extracted data as a reusable profile for future sessions
- Export Field Data: Download extracted data as JSON or CSV for external use
- Source Library: Add frequently-used source documents to the source library for one-click inclusion in future sessions
This is especially useful for frequently-used information like company details, personal information, or standard form responses that don't change between submissions. For processing multiple forms at once, see batch processing.