File Management System

Upload, organize, and access forms, source documents, and filled PDFs — stored in Azure Blob Storage with full workspace isolation

Overview

File storage in Instafill.ai is backed by Azure Blob Storage. Every file reference carries workspace and user context — this context is included in all storage operations and enforced by authentication middleware across all service layers, so files cannot be accessed across workspace boundaries.

Files enter the pipeline in three forms: form templates uploaded and indexed; source documents uploaded to sessions or profiles (profile batch uploads support up to 10 files per request); and filled PDF outputs generated after composing the final overlay. Non-PDF source files are converted before processing — Word documents via Google Drive API or Adobe PDF Services; images via vision AI for OCR.

Deleted files enter a 30-day soft-delete trash period before permanent removal. Files processed through the email integration are deleted from storage 24 hours after processing. Stateless mode sessions delete source files immediately on session completion.

Key Capabilities

  • Azure Blob Storage Backend: All binary files stored in Azure Blob Storage with workspace-scoped access control
  • Workspace-Scoped Access: Every file reference includes workspace and user context; authentication middleware enforces workspace boundaries
  • Form Template Storage: Templates are indexed and deduplicated by field hash — common government forms are processed once and shared across uploads
  • Source Document Uploads: Session and profile sources stored in blob; profile batch accepts up to 10 files per request
  • Filled PDF Output: Composed PDFs are written to blob storage and made available for download
  • Format Conversion: Word → PDF via Google Drive API or Adobe PDF Services; images → text via vision AI OCR
  • OCR Pipeline: Scanned documents processed via vision AI after image pre-processing
  • 30-Day Trash: Deleted files enter soft-delete trash; permanent deletion after 30 days
  • Stateless Mode: Source files deleted immediately on session completion (no retention)
  • Email Attachment Retention: Attachments processed via email integration deleted from blob after 24 hours

How It Works

  1. Form Template Upload:

    • PDF uploaded and processed to extract fields and text per page
    • Field hash checked for deduplication — if a matching form already exists in the catalog, the new upload clones it rather than reprocessing
    • Form blob and metadata stored with workspace context
  2. Source Document Upload:

    • Files uploaded to a session or profile source list
    • Profile files are merged into session sources before autofill runs
    • Processing status available at GET /api/sessions/{session_id}/process-sources-status
    • Batch profile uploads: max 10 files per request
    • Text extracted from source documents is encrypted before database storage; the binary file stays in blob storage
  3. Non-PDF Conversion:

    • Word (.docx/.doc): converted via Google Drive API or Adobe PDF Services
    • Images (JPEG, PNG, TIFF): pre-processed, then passed to Google Cloud Vision API for OCR; output assembled into PDF
    • All non-PDF formats converted to PDF before entering the fill pipeline
  4. Filled PDF Output:

    • After autofill completes field values, the composed PDF is written to Azure Blob Storage
    • Output blob URL returned in the session response and available for download
  5. File Deletion:

    • User-deleted files enter 30-day soft-delete trash; blob not removed until expiry
    • Stateless mode: blobs deleted immediately after session completion
    • Email attachments: blob deleted after 24-hour retention window
    • GDPR erasure: deletion propagates across all file types within the workspace

Use Cases

Law firms managing immigration matters store client document sets as profile sources — each client's I-485 supporting documents, passport scans, and affidavits uploaded once and reused across all related filings. HR departments maintain a library of onboarding form templates (W-4, I-9, direct deposit) so new-hire packets are assembled from already-indexed templates without re-uploading. Healthcare providers use stateless mode for PHI source documents — files are processed and immediately deleted, leaving no residual storage footprint.

Benefits

  • Consistent Access Control: Workspace context in every file reference means access control is enforced at the storage operation level, not just in the UI
  • Deduplication: Field-hash deduplication means common government forms (W-4, I-9, CMS-1500) are processed once and shared across uploads, reducing storage and re-indexing overhead
  • Retention Flexibility: Standard 30-day trash, immediate deletion in stateless mode, 24-hour email attachment cleanup — each workflow gets the retention behavior that fits its data sensitivity
  • Full Format Coverage: PDF, Word, and image file types are all supported — any common document type reaches the fill pipeline

Security & Privacy

All data is workspace-scoped and protected by JWT authentication middleware across all service layers.

Storage Security:

  • Azure Blob Storage with Azure Key Vault-managed encryption keys
  • Every file reference includes workspace context; cross-workspace access is blocked at the authentication middleware level
  • Text extracted from source documents is encrypted with workspace-scoped keys before database storage
  • Binary files (PDFs) are stored in blob storage with Azure-managed encryption

Retention Controls:

  • Standard: 30-day soft-delete trash; permanent deletion after expiry
  • Stateless mode: Immediate deletion on session completion (for PHI and sensitive source data)
  • Email attachments: 24-hour retention then deleted
  • GDPR erasure: Propagates across sessions, sources, forms, and profiles within the workspace

Access Logging:

  • All file access operations include workspace and user context
  • JWT claims validated per request — revoked workspace membership takes effect on the next request

Common Questions

Where are files actually stored?

All binary files are stored in Azure Blob Storage. Both service layers use the same Azure Blob Storage backend for uploads and downloads — source documents, extracted images, filled PDF outputs, form template uploads, and batch file imports all go through the same storage infrastructure.

Every file reference is created with workspace and user context embedded. Key material for storage-layer encryption is isolated in Azure Key Vault — not in application code.

Text content extracted from source documents is separately encrypted at the application layer before being written to the database; that ciphertext is distinct from the blob storage of the original binary file.

What happens to uploaded files after a session completes?

It depends on the session mode:

Standard sessions:

  • Source files remain in storage as part of the session record
  • Available for review and re-fill until manually deleted
  • Deleted files enter 30-day soft-delete trash before permanent removal

Stateless mode sessions:

  • Source files (blobs and encrypted text) deleted immediately when the session completes
  • No retention period — designed for workflows where source data must not persist after the fill
  • Filled PDF output is still returned in the API response before deletion

Email-triggered sessions:

  • Email attachments stored for 24 hours after processing
  • Deleted automatically after the retention window regardless of session state

Profile sources (not session-bound):

  • Retained as part of the profile record until explicitly deleted
  • Profile files are merged into session sources before each fill run; the profile copy is not affected by session deletion
How does deduplication work for form templates?

When a form template is uploaded, the system checks for deduplication before indexing:

  1. Field hash check: The extracted field set is hashed. If an existing form in the catalog has the same field hash, the new upload clones the existing form's field data — no re-extraction required.

  2. Visual fallback: If the field hash doesn't match, a hash of the rendered page image provides a secondary deduplication check for flat PDFs converted to fillable.

  3. Cloning benefit: Common government forms (W-4, I-9, CMS-1500, 1003) uploaded by multiple workspaces resolve to the same canonical field set, reducing indexing overhead and ensuring consistent field definitions.

The clone is workspace-scoped — the shared field data is not cross-workspace accessible, only the field schema is reused.

What file formats can be used as source documents?

Source documents are converted to PDF before entering the fill pipeline:

PDFs: Passed directly to field extraction or OCR (vision AI for scanned PDFs).

Word Documents (.docx, .doc): Converted to PDF via Google Drive API or Adobe PDF Services.

Images (JPEG, PNG, TIFF, BMP): Pre-processed for deskewing and contrast normalization, then passed to Google Cloud Vision API for OCR before assembly into a PDF.

Plain Text: Passed directly to the fill pipeline as text content; encrypted before storage.

Email Body: Parsed from inbound email, treated as plain text source.

File size limits apply per plan tier; the email integration enforces a 24-hour retention policy for email attachments regardless of plan.

Related Features

Ready to get started?

Start automating your form filling process today with Instafill.ai

Try Instafill.ai View Pricing