Multilingual Support
Process documents and data in multiple languages with full Unicode support, preserving special characters and handling complex scripts
Overview
Multilingual Support enables Instafill.ai to process forms and source documents in languages beyond English across the entire fill pipeline — from document parsing and AI field extraction to final PDF output. The system handles Latin-based languages (Spanish, French, German, Portuguese) and Cyrillic scripts (Bulgarian, Serbian, Ukrainian, including dialects of Ukrainian such as Russian), Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and languages with diacritical marks or special characters (Polish, Vietnamese, Turkish). Character integrity is preserved throughout: UTF-8 encoding is used in the database, API layer, and PDF generation, and any font glyphs required to render special characters are embedded directly into the final PDF rather than referenced as system fonts.
This matters in two concrete scenarios: first, cross-language field matching — where a Spanish-language form ("Nombre," "Dirección," "Ciudad") is filled from English source data, or vice versa — handled via semantic understanding of field label equivalence across languages. Second, script-specific rendering — where Arabic and Hebrew text requires right-to-left layout, bidirectional mixed-script handling (the Unicode Bidirectional Algorithm), and RTL-aware PDF field direction settings.
The system also handles common encoding pitfalls: Unicode normalization ensures that "José" stored as precomposed (U+00E9) and "José" stored as decomposed (e + combining acute U+0301) are treated as the same string, preventing duplicate or non-matching entries when source documents use different encoding conventions.
Key Capabilities
- 100+ Languages Supported: Process forms and data in virtually any written language
- Full Unicode Support: UTF-8 encoding throughout the entire system (all 143,000+ Unicode characters)
- Special Character Preservation: Diacritics, umlauts, tildes, accents rendered correctly
- Complex Scripts: Arabic, Hebrew, Thai, Hindi, Chinese, Japanese, Korean
- Bidirectional Text: Right-to-left (RTL) scripts like Arabic/Hebrew handled via Unicode BiDi Algorithm
- Mixed-Language Documents: Forms with multiple languages (e.g., English + Spanish sections)
- Automatic Language Detection: AI detects language of source data automatically
- Cross-Language Matching: Match English field labels to data in other languages (and vice versa)
- Character Normalization: Handles different Unicode representations of the same character (é vs e + ´)
- Font Embedding: Embeds required font glyphs into final PDF — no dependency on system fonts
- OCR Language Support: Google Cloud Vision API for non-English scanned documents
- Transliteration (optional): Convert non-Latin names to Latin equivalents for US forms
How It Works
Language Detection & Processing
Automatic Detection:
When you submit source data:
Language Identification: AI analyzes text to detect language(s)
- "Nombre: Juan García" → Detected: Spanish
- "名前: 田中太郎" → Detected: Japanese
- Mixed text: "Name: José Müller" → Detected: Spanish + German characters
Character Set Analysis: Identifies special characters requiring preservation
- "François" → Requires: ç cedilla
- "Müller" → Requires: ü umlaut
- "Żółć" → Requires: ó, ł, ć Polish characters
Script Detection: Identifies writing system
- Latin (A-Z), Cyrillic (А-Я), Arabic (ا-ي), CJK (Chinese/Japanese/Korean)
Form Language Analysis:
AI analyzes form to determine:
- Form Language: What language are field labels in? (English, Spanish, French, etc.)
- Expected Data Language: What language should data be in? (may differ from form language)
- Mixed Sections: Does form have sections in different languages?
Cross-Language Field Matching
Scenario: English data → Spanish form (or vice versa)
Example:
- Source Data (English): "Name: John Smith, Address: 123 Main St, New York, NY"
- Form Fields (Spanish): "Nombre," "Dirección," "Ciudad," "Estado"
AI Matching:
- Semantic Understanding: AI knows "Nombre" = "Name" in English
- Field Mapping: Maps "John Smith" → "Nombre" field despite language difference
- Location Intelligence: Understands "Ciudad" (city) corresponds to "New York"
- Consistent Filling: Fills form accurately despite language mismatch
Supported Cross-Language Scenarios:
- English data → Spanish/French/German/Portuguese forms
- Spanish/French/German data → English forms
- Any language data → Forms in major world languages
Unicode and Character Encoding
Technical Implementation:
- UTF-8 Throughout: Entire system uses UTF-8 encoding (supports all Unicode characters)
- Database: Unicode-aware database storage (no mojibake or character corruption)
- API: All API endpoints accept/return UTF-8
- PDF Generation: Embeds necessary font glyphs to render special characters correctly in final PDF
Character Normalization:
Different ways to represent the same character in Unicode:
- Precomposed: é (single character U+00E9)
- Decomposed: e + ´ (two characters: e U+0065 + combining acute accent U+0301)
AI Handling:
- Normalizes to consistent form (typically precomposed)
- Ensures "José" matches "José" even if encoded differently
- Prevents duplicate entries from encoding differences
Bidirectional Text (Arabic, Hebrew)
Challenge: Arabic and Hebrew read right-to-left (RTL), but numbers and English words within them read left-to-right (LTR).
Example (Arabic):
English: "John lives at 123 Main Street in Cairo"
Arabic: "جون يعيش في 123 شارع مين في القاهرة"
AI Handling:
- Direction Detection: Recognizes RTL language
- Bidirectional Algorithm: Applies Unicode Bidirectional Algorithm (UBA)
- Correct Ordering: Ensures characters display in proper reading order
- PDF Preservation: Sets PDF field direction to RTL; embeds Arabic font in output PDF
Language-Specific Features
Date Formats:
- US: MM/DD/YYYY (02/16/2024)
- Europe: DD/MM/YYYY (16/02/2024)
- ISO: YYYY-MM-DD (2024-02-16)
- AI detects format from source data, converts to form's expected format
Name Ordering:
- Western: First Name + Last Name (John Smith)
- Chinese/Korean: Last Name + First Name (李明)
- Spanish: Name + Paternal Surname + Maternal Surname (Juan García López)
- AI understands cultural conventions, fills form fields correctly
Address Formats:
- US: Street, City, State, ZIP
- UK: Street, City, Postcode
- Japan: Postal Code, Prefecture, City, Street
- AI adapts to form's address structure
Use Cases
Multilingual support is used by organizations that operate across language boundaries. Multinationals running employee onboarding across 20+ countries fill forms in each local language from a single data profile, government agencies serving non-English-speaking populations auto-fill forms in the applicant's language, and international law firms complete cross-border agreements where client documents arrive in one language but court filings must be submitted in another.
Benefits
- Global Reach: Operate in any country without language barriers
- Accuracy: Eliminate errors from manual translation or transcription
- Compliance: Meet language accessibility requirements (US: Title VI, EU: Language Regulations)
- Cost Savings: No need for translation services for routine forms
- Speed: Process multilingual forms at the same speed as English (no translation overhead)
- Cultural Sensitivity: Preserve native names and special characters (respectful, accurate)
- Scalability: Add new languages as needed without system changes
- Legal Validity: Preserve exact names for legal documents (avoid transliteration errors)
- Data Integrity: No character corruption or encoding errors
Security & Privacy
Data Handling:
- No Translation Storage: Source data not sent to external translation APIs — processed internally
- Encrypted Transit: All data encrypted in transit (TLS 1.3) regardless of language
- Encrypted Storage: Text source content encrypted via
handle_text_encryption()(PyCryptodome 3.19.0) with workspace-scoped keys
Character Encoding Security:
- Homoglyph Detection: Detects visually similar characters from different scripts (e.g., Cyrillic 'а' vs Latin 'a' used for phishing)
- Validation: Ensures submitted data matches expected character set for field
Access Control:
- All multilingual data scoped to
workspaceIdand protected via the shared JWT authentication middleware running in both the .NET and Python service layers
Compliance:
- GDPR: Right to erasure applies to data in any language
- HIPAA: PHI protection applies regardless of language
Common Questions
Which languages are supported?
All major world languages, including:
Latin-Based Languages:
- English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Vietnamese, Turkish, Swedish, Norwegian, Danish, Finnish, Hungarian, etc.
Cyrillic Scripts:
- Ukrainian, Bulgarian, Serbian, Macedonian, Belarusian, etc.
Asian Languages:
- Chinese: Simplified (简体) and Traditional (繁體)
- Japanese: Hiragana (ひらがな), Katakana (カタカナ), Kanji (漢字)
- Korean: Hangul (한글)
- Thai: ไทย
- Hindi/Indian Languages: Hindi (हिन्दी), Bengali (বাংলা), Tamil (தமிழ்), Telugu (తెలుగు), etc.
Middle Eastern Languages:
- Arabic (العربية), Hebrew (עברית), Persian (فارسی), Urdu (اردو)
Other Scripts:
- Greek (Ελληνικά), Armenian (Հայերեն), Georgian (ქართული), Amharic (አማርኛ)
Technical Limit: Any language representable in Unicode (99.9% of human languages).
Not Supported: Ancient scripts not in Unicode (e.g., undeciphered scripts), custom proprietary symbols.
Can the AI translate forms from one language to another?
Limited translation capability (field label understanding only, not full translation):
What AI Does:
- Understands Equivalence: Knows "Name" (English) = "Nombre" (Spanish) = "Nom" (French)
- Cross-Language Matching: Can match English data to Spanish form fields
- Field Label Translation: Translates field labels internally for matching
What AI Does NOT Do:
- Full Document Translation: Does not translate an entire form from Spanish to English
- Content Translation: Does not translate data content (if data says "Ingeniero," output says "Ingeniero," not "Engineer")
Example Scenario:
Supported:
- Source data (English): "Name: John Smith"
- Form (Spanish): "Nombre: ______"
- AI Action: Fills "John Smith" into "Nombre" field (understands label equivalence)
NOT Supported (requires a separate translation service):
- Form in Spanish: "Nombre," "Dirección," "Profesión"
- User wants output form in English: "Name," "Address," "Profession"
- AI cannot translate the entire form structure to a different language
Best Practice: Use forms in the language your data is already in. Avoid translation whenever possible.
How are special characters ensured to display correctly in the final PDF?
Font Embedding:
Challenge: PDFs require specific fonts to render special characters. If the font is missing, characters display as □ (tofu) or incorrectly.
Instafill.ai Solution:
Character Detection: AI identifies special characters in your data
- "José" → Requires: ó (o with acute accent)
- "Müller" → Requires: ü (u with umlaut)
- "李明" → Requires: Chinese font
Font Selection: Selects appropriate font supporting required characters
- Latin extended: Arial Unicode MS, DejaVu Sans
- Chinese: Noto Sans CJK, SimSun
- Arabic: Noto Sans Arabic, Arial
- Emoji: Noto Emoji
Font Embedding: Embeds necessary font glyphs in final PDF
- PDF contains the character shapes directly, not references to system fonts
- Ensures PDF displays correctly on any device (even if that device lacks the font)
Fallback Handling: If character unsupported by primary font:
- AI tries alternate fonts
- Warns if character truly cannot be rendered (extremely rare)
Result: Final PDF opens correctly on any device, displaying all special characters perfectly.
Technical Note: Font embedding increases PDF file size slightly (typically 50–200 KB per embedded font), but ensures universal compatibility.
What about right-to-left languages like Arabic and Hebrew?
Full Bidirectional (BiDi) Support:
Challenges with RTL Languages:
- Text reads right-to-left: "Hello" in Arabic is "مرحبا" (read right → left)
- Numbers read left-to-right even within RTL text: "123" reads "123" (not "321")
- Mixed text: "John lives in القاهرة" (English LTR + Arabic RTL in same sentence)
AI Handling:
- Direction Detection: Automatically detects RTL language
- Unicode BiDi Algorithm: Applies standard Unicode Bidirectional Algorithm
- Correct Ordering: Ensures characters display in proper reading order
- PDF RTL Setting: Sets PDF field text direction to RTL; embeds Arabic/Hebrew font in output
Supported RTL Languages:
- Arabic (العربية), Hebrew (עברית), Persian (فارسی), Urdu (اردو), Yiddish (ייִדיש)
Not an Issue: Forms with mixed LTR/RTL (e.g., English form with Arabic data in some fields) handled correctly — each field has independent directionality.
Can I use multilingual data in Profiles?
Yes, fully supported.
Use Case: Company with global workforce creates employee profiles in local languages.
Example Profile (Mexican Employee):
Name: María García López
Dirección: Av. Insurgentes Sur 123, Ciudad de México, CDMX 03100
Teléfono: +52 55 1234 5678
Correo: [email protected]
RFC: GALM850101ABC (Mexican tax ID)
Profile Behavior:
- Storage: All data stored with full Unicode support (special characters preserved)
- Reuse: Profile can fill any form (Spanish, English, or other language)
- Cross-Language: Spanish profile data can fill English form fields
- Profile "Dirección" → English form "Address"
- Character Preservation: "García" stays "García" (not corrupted to "Garcia")
Best Practice:
- Single Language per Profile: Keep profile data in one primary language for consistency
- Transliteration Field (optional): Add English version of name for US forms
- Native: "María García López"
- English: "Maria Garcia Lopez" (ASCII-safe)
Multi-Language Profiles (Enterprise):
- Store data in multiple languages within same profile
- System selects appropriate language version for the form being filled