Document Format Conversion
Convert non-PDF inputs to PDF using Adobe PDFServices SDK 4.3.1, PIL/Pillow, and Google Cloud Vision OCR before entering the form-filling pipeline
Overview
Instafill.ai's form-filling pipeline operates on PDF files with AcroForm fields. Document conversion is the ingestion layer that normalizes every non-PDF input into a PDF before any downstream processing begins. A Word document template for a legal contract, a stack of scanned TIFF images from a paper form batch, or a PNG photograph of a hand-delivered application — all are converted to PDF first, then passed to field extraction or flat-to-fillable conversion depending on whether the resulting PDF has AcroForm fields.
Word-to-PDF conversion uses Adobe PDFServices SDK 4.3.1 as the primary path, invoked through AdobePdfService.cs in the .NET service layer, with a python-docx plus ReportLab fallback for documents that cannot be processed by the SDK. Image-to-PDF conversion uses PIL/Pillow for straightforward raster-to-PDF wrapping, pdf2image and OpenCV 4.9.0 for preprocessing (deskew, contrast normalization, noise reduction), and Google Cloud Vision API for OCR when the image contains text that needs to be preserved as a searchable layer in the output PDF.
Key Capabilities
- Word-to-PDF via Adobe PDFServices SDK 4.3.1: Primary conversion path invoked by
AdobePdfService.cs; preserves layout, embedded fonts, and table structure from .docx files - python-docx + ReportLab fallback: Used when the SDK path is unavailable or the input is a programmatically generated .docx without complex layout dependencies
- Image-to-PDF via PIL/Pillow: Wraps PNG, JPEG, TIFF (multi-page), and BMP files into PDF containers, one image per page
- OpenCV 4.9.0 preprocessing: Deskew, binarization, and noise reduction applied to scanned images before PDF wrapping to improve downstream field detection
- Google Cloud Vision API OCR: Extracts text from image-based inputs and embeds a searchable text layer in the output PDF; used for printed-form scans where the text layer is required by flat-to-fillable conversion's
detect_blanks()fallback - All inputs converted before pipeline entry: No non-PDF format enters field extraction or flat-to-fillable conversion; conversion is a mandatory gate in the ingestion flow
- Batch conversion: Multiple files of the same or mixed types can be converted in a single operation
How It Works
Input classification: The uploaded file's MIME type and extension are inspected. PDF files pass through without conversion. Word (.docx, .doc), image (PNG, JPEG, TIFF, BMP), and HTML files are routed to the appropriate converter.
Word-to-PDF conversion:
.docxand.docfiles are sent to Adobe PDFServices SDK 4.3.1 viaAdobePdfService.cs. The SDK handles complex Word layouts — multi-column sections, embedded images, tables with merged cells, headers and footers — and produces a PDF that preserves the visual fidelity of the source document. If the SDK call fails (network error, unsupported feature, license limit), python-docx parses the document structure and ReportLab renders it to PDF as a fallback. The fallback produces a structurally correct but less visually precise PDF, suitable for forms with simple layouts.Image preprocessing with OpenCV 4.9.0: Before wrapping images into PDF, OpenCV applies deskew (correcting page tilt introduced by scanner misalignment), binarization (converting grayscale scans to clean black-and-white for better OCR), and noise reduction (removing speckles and background artifacts from low-quality scans). This preprocessing step directly improves the accuracy of
detect_boxes_fitz()anddetect_blanks()in the downstream flat-to-fillable pipeline.OCR with Google Cloud Vision API: For image inputs that will need a searchable text layer — scanned forms where
detect_blanks()requires the pdfplumber text layer to find blank regions — the preprocessed image is sent to Google Cloud Vision API. The API returns word-level bounding boxes and text strings, which are embedded as invisible text on the corresponding PDF page. This produces a text-searchable PDF where the raster image is the visible layer and the OCR output is the parseable layer.Image-to-PDF wrapping with PIL/Pillow: PIL/Pillow writes the preprocessed image (or images, for multi-page TIFFs) into a PDF container. Each image becomes one PDF page. Page dimensions match the image dimensions at the source DPI so that coordinate-based field detection in the downstream pipeline operates at the correct scale.
PDF handoff to pipeline: The converted PDF is passed to the form-filling pipeline. If the PDF contains AcroForm widgets, it goes to
extract_pdf_field_names()for field extraction. If it is a flat PDF (no AcroForm layer), it goes to the flat-to-fillable conversion pipeline starting withdetect_boxes_fitz().
Use Cases
Document conversion is used wherever forms or supporting documents exist in non-PDF formats. Legal teams convert Word contract templates (.docx) to PDF using Adobe PDFServices SDK so the resulting PDF can be processed by flat-to-fillable conversion and then filled programmatically from client data. Healthcare providers send batches of scanned paper CMS-1500 forms (TIFF or JPEG) through OpenCV preprocessing and Google Cloud Vision OCR, producing searchable PDFs that the flat-to-fillable pipeline can then convert into fillable AcroForms. Immigration attorneys receive supporting documents (birth certificates, tax transcripts) as JPEG scans and convert them to PDF before merging them into USCIS application packets via the PDF manipulation layer. Businesses that maintain form templates in Word convert them to PDF once, then use the converted PDF as the base for repeated fill sessions without rerunning conversion.
Benefits
- Single-format pipeline: Converting all inputs to PDF before processing means field extraction, flat-to-fillable conversion, and PDF manipulation operate on one format with well-defined behavior — no format-specific edge cases propagate into the core pipeline
- OpenCV preprocessing improves downstream accuracy: Deskew and binarization before PDF wrapping directly improves
detect_boxes_fitz()hit rates on scanned forms; a 5° page tilt that would cause border detection to miss horizontal lines is corrected before the detector runs - SDK fidelity for Word forms: Adobe PDFServices SDK 4.3.1 preserves Word table structures that are common in legal and HR form templates — merged cells, nested tables, header rows — which would be flattened or broken by a simple rendering-based conversion
- OCR layer enables text-based detection: Google Cloud Vision's word-level bounding boxes embedded as PDF text allow
detect_blanks()to find fill areas on scanned forms by analyzing text layout, rather than relying solely on visual border detection
Security & Privacy
Document conversion processes input files in memory within the conversion service; source files and intermediate raster images are not written to permanent storage during conversion. Google Cloud Vision API calls transmit image data to Google's OCR infrastructure under the terms of the service agreement; no OCR results are retained beyond the current conversion request. Adobe PDFServices SDK calls transmit document data to Adobe's cloud processing endpoints; documents are not retained after conversion completes per Adobe's data handling policy. Data is scoped to workspaceId and protected via the shared JWT authentication middleware running in both the .NET and Python service layers. All conversion events are written to the audit log with user ID, workspace ID, source file name, target format, and timestamp.
Common Questions
Does Word-to-PDF conversion preserve formatting perfectly?
Adobe PDFServices SDK 4.3.1 produces high-fidelity output for Word documents that use standard formatting. Documents with standard paragraph styles, embedded images, simple and complex tables, headers and footers, and standard fonts typically convert with 90–95% visual fidelity. The python-docx plus ReportLab fallback is less precise — it correctly handles paragraph flow and basic tables but may lose custom spacing, complex table formatting, or non-standard fonts. For form templates where exact field position matters, validate the converted PDF visually before using it as a flat-to-fillable source; minor layout shifts can affect field detection accuracy.
Can OCR extract text from handwritten documents?
Google Cloud Vision API is the OCR engine. It achieves 98–99% accuracy on machine-printed text and 70–85% on clearly hand-printed block letters at 300 DPI. Cursive handwriting accuracy drops to 40–60% and is highly dependent on writing clarity. For handwritten form responses, OCR produces a best-effort text layer that may contain errors; the raster image layer in the output PDF is always accurate and serves as the authoritative visual record. Downstream flat-to-fillable conversion uses the text layer only for field boundary detection, not for reading handwritten values, so OCR errors in the text layer do not corrupt field positions derived from detect_boxes_fitz().
What happens to form fields when converting a Word document to PDF?
Word documents converted via Adobe PDFServices SDK 4.3.1 do not produce AcroForm fields in the output PDF — Word form controls (content controls, legacy form fields) are rendered as their visual appearance in the PDF, not as AcroForm widgets. The resulting PDF is a flat PDF with no fillable field layer. To make it fillable, it must be processed by flat-to-fillable conversion, which will detect the visual field boundaries and synthesize AcroForm widgets. This two-step path (Word → PDF via SDK → flat-to-fillable conversion) is the standard workflow for converting Word form templates into fillable AcroForm PDFs.
Can I convert multiple files at once?
Yes. Batch conversion accepts multiple files of the same or mixed types in a single operation. All files are processed in parallel workers. Each file follows its type-appropriate conversion path (Word files to Adobe PDFServices SDK, images to OpenCV + PIL/Pillow + optional Google Cloud Vision). Output PDFs are available as individual files or as a ZIP archive. A batch of 50 scanned TIFF forms processed through OpenCV and Google Cloud Vision typically completes in 8–15 minutes depending on image resolution and page count per file.
Are there file size or page limits for conversion?
Limits vary by plan:
Document Size:
- Free: 10 MB, 20 pages
- Starter: 50 MB, 100 pages
- Professional: 200 MB, 500 pages
- Enterprise: 1 GB, 5,000 pages
Adobe PDFServices SDK enforces its own per-request size limits; documents approaching the plan maximum may need to be split before conversion and merged afterward using the PDF manipulation layer. Google Cloud Vision API has a 20 MB limit per image request; images above this threshold are downsampled before OCR and then upsampled back before PDF wrapping, which may reduce OCR accuracy slightly.