Multilingual Support

Process documents and data in multiple languages with full Unicode support, preserving special characters and handling complex scripts

Overview

Multilingual Support enables Instafill.ai to process forms and source documents in languages beyond English across the entire fill pipeline — from document parsing and AI field extraction to final PDF output. The system handles Latin-based languages (Spanish, French, German, Portuguese) and Cyrillic scripts (Bulgarian, Serbian, Ukrainian, including dialects of Ukrainian such as Russian), Asian scripts (Chinese, Japanese, Korean), Arabic, Hebrew, and languages with diacritical marks or special characters (Polish, Vietnamese, Turkish). Character integrity is preserved throughout: UTF-8 encoding is used in the database, API layer, and PDF generation, and any font glyphs required to render special characters are embedded directly into the final PDF rather than referenced as system fonts.

This matters in two concrete scenarios: first, cross-language field matching — where a Spanish-language form ("Nombre," "Dirección," "Ciudad") is filled from English source data, or vice versa — handled via semantic understanding of field label equivalence across languages. Second, script-specific rendering — where Arabic and Hebrew text requires right-to-left layout, bidirectional mixed-script handling (the Unicode Bidirectional Algorithm), and RTL-aware PDF field direction settings.

The system also handles common encoding pitfalls: Unicode normalization ensures that "José" stored as precomposed (U+00E9) and "José" stored as decomposed (e + combining acute U+0301) are treated as the same string, preventing duplicate or non-matching entries when source documents use different encoding conventions.

Key Capabilities

100+ Languages Supported: Process forms and data in virtually any written language
Full Unicode Support: UTF-8 encoding throughout the entire system (all 143,000+ Unicode characters)
Special Character Preservation: Diacritics, umlauts, tildes, accents rendered correctly
Complex Scripts: Arabic, Hebrew, Thai, Hindi, Chinese, Japanese, Korean
Bidirectional Text: Right-to-left (RTL) scripts like Arabic/Hebrew handled via Unicode BiDi Algorithm
Mixed-Language Documents: Forms with multiple languages (e.g., English + Spanish sections)
Automatic Language Detection: AI detects language of source data automatically
Cross-Language Matching: Match English field labels to data in other languages (and vice versa)
Character Normalization: Handles different Unicode representations of the same character (é vs e + ´)
Font Embedding: Embeds required font glyphs into final PDF — no dependency on system fonts
OCR Language Support: Google Cloud Vision API for non-English scanned documents
Transliteration (optional): Convert non-Latin names to Latin equivalents for US forms

How It Works

Language Detection & Processing

Automatic Detection:

When you submit source data:

Language Identification: AI analyzes text to detect language(s)
- "Nombre: Juan García" → Detected: Spanish
- "名前: 田中太郎" → Detected: Japanese
- Mixed text: "Name: José Müller" → Detected: Spanish + German characters
Character Set Analysis: Identifies special characters requiring preservation
- "François" → Requires: ç cedilla
- "Müller" → Requires: ü umlaut
- "Żółć" → Requires: ó, ł, ć Polish characters
Script Detection: Identifies writing system
- Latin (A-Z), Cyrillic (А-Я), Arabic (ا-ي), CJK (Chinese/Japanese/Korean)

Form Language Analysis:

AI analyzes form to determine:

Form Language: What language are field labels in? (English, Spanish, French, etc.)
Expected Data Language: What language should data be in? (may differ from form language)
Mixed Sections: Does form have sections in different languages?

Cross-Language Field Matching

Scenario: English data → Spanish form (or vice versa)

Example:

Source Data (English): "Name: John Smith, Address: 123 Main St, New York, NY"
Form Fields (Spanish): "Nombre," "Dirección," "Ciudad," "Estado"

AI Matching:

Semantic Understanding: AI knows "Nombre" = "Name" in English
Field Mapping: Maps "John Smith" → "Nombre" field despite language difference
Location Intelligence: Understands "Ciudad" (city) corresponds to "New York"
Consistent Filling: Fills form accurately despite language mismatch

Supported Cross-Language Scenarios:

English data → Spanish/French/German/Portuguese forms
Spanish/French/German data → English forms
Any language data → Forms in major world languages

Unicode and Character Encoding

Technical Implementation:

UTF-8 Throughout: Entire system uses UTF-8 encoding (supports all Unicode characters)
Database: Unicode-aware database storage (no mojibake or character corruption)
API: All API endpoints accept/return UTF-8
PDF Generation: Embeds necessary font glyphs to render special characters correctly in final PDF

Character Normalization:

Different ways to represent the same character in Unicode:

Precomposed: é (single character U+00E9)
Decomposed: e + ´ (two characters: e U+0065 + combining acute accent U+0301)

AI Handling:

Normalizes to consistent form (typically precomposed)
Ensures "José" matches "José" even if encoded differently
Prevents duplicate entries from encoding differences

Bidirectional Text (Arabic, Hebrew)

Challenge: Arabic and Hebrew read right-to-left (RTL), but numbers and English words within them read left-to-right (LTR).

Example (Arabic):

English: "John lives at 123 Main Street in Cairo"
Arabic: "جون يعيش في 123 شارع مين في القاهرة"

AI Handling:

Direction Detection: Recognizes RTL language
Bidirectional Algorithm: Applies Unicode Bidirectional Algorithm (UBA)
Correct Ordering: Ensures characters display in proper reading order
PDF Preservation: Sets PDF field direction to RTL; embeds Arabic font in output PDF

Language-Specific Features

Date Formats:

US: MM/DD/YYYY (02/16/2024)
Europe: DD/MM/YYYY (16/02/2024)
ISO: YYYY-MM-DD (2024-02-16)
AI detects format from source data, converts to form's expected format

Name Ordering:

Western: First Name + Last Name (John Smith)
Chinese/Korean: Last Name + First Name (李明)
Spanish: Name + Paternal Surname + Maternal Surname (Juan García López)
AI understands cultural conventions, fills form fields correctly

Address Formats:

US: Street, City, State, ZIP
UK: Street, City, Postcode
Japan: Postal Code, Prefecture, City, Street
AI adapts to form's address structure

Use Cases

Multilingual support is used by organizations that operate across language boundaries. Multinationals running employee onboarding across 20+ countries fill forms in each local language from a single data profile, government agencies serving non-English-speaking populations auto-fill forms in the applicant's language, and international law firms complete cross-border agreements where client documents arrive in one language but court filings must be submitted in another.

Benefits

Global Reach: Operate in any country without language barriers
Accuracy: Eliminate errors from manual translation or transcription
Compliance: Meet language accessibility requirements (US: Title VI, EU: Language Regulations)
Cost Savings: No need for translation services for routine forms
Speed: Process multilingual forms at the same speed as English (no translation overhead)
Cultural Sensitivity: Preserve native names and special characters (respectful, accurate)
Scalability: Add new languages as needed without system changes
Legal Validity: Preserve exact names for legal documents (avoid transliteration errors)
Data Integrity: No character corruption or encoding errors

Security & Privacy

Data Handling:

No Translation Storage: Source data not sent to external translation APIs — processed internally
Encrypted Transit: All data encrypted in transit (TLS 1.3) regardless of language
Encrypted Storage: Text source content encrypted via handle_text_encryption() (PyCryptodome 3.19.0) with workspace-scoped keys

Character Encoding Security:

Homoglyph Detection: Detects visually similar characters from different scripts (e.g., Cyrillic 'а' vs Latin 'a' used for phishing)
Validation: Ensures submitted data matches expected character set for field

Access Control:

All multilingual data scoped to workspaceId and protected via the shared JWT authentication middleware running in both the .NET and Python service layers

Compliance:

GDPR: Right to erasure applies to data in any language
HIPAA: PHI protection applies regardless of language

Common Questions

Which languages are supported?

All major world languages, including:

Latin-Based Languages:

English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Czech, Romanian, Vietnamese, Turkish, Swedish, Norwegian, Danish, Finnish, Hungarian, etc.

Cyrillic Scripts:

Ukrainian, Bulgarian, Serbian, Macedonian, Belarusian, etc.

Asian Languages:

Chinese: Simplified (简体) and Traditional (繁體)
Japanese: Hiragana (ひらがな), Katakana (カタカナ), Kanji (漢字)
Korean: Hangul (한글)
Thai: ไทย
Hindi/Indian Languages: Hindi (हिन्दी), Bengali (বাংলা), Tamil (தமிழ்), Telugu (తెలుగు), etc.

Middle Eastern Languages:

Arabic (العربية), Hebrew (עברית), Persian (فارسی), Urdu (اردو)

Other Scripts:

Greek (Ελληνικά), Armenian (Հայերեն), Georgian (ქართული), Amharic (አማርኛ)

Technical Limit: Any language representable in Unicode (99.9% of human languages).

Not Supported: Ancient scripts not in Unicode (e.g., undeciphered scripts), custom proprietary symbols.

Can the AI translate forms from one language to another?

Limited translation capability (field label understanding only, not full translation):

What AI Does:

Understands Equivalence: Knows "Name" (English) = "Nombre" (Spanish) = "Nom" (French)
Cross-Language Matching: Can match English data to Spanish form fields
Field Label Translation: Translates field labels internally for matching

What AI Does NOT Do:

Full Document Translation: Does not translate an entire form from Spanish to English
Content Translation: Does not translate data content (if data says "Ingeniero," output says "Ingeniero," not "Engineer")

Example Scenario:

Supported:

Source data (English): "Name: John Smith"
Form (Spanish): "Nombre: ______"
AI Action: Fills "John Smith" into "Nombre" field (understands label equivalence)

NOT Supported (requires a separate translation service):

Form in Spanish: "Nombre," "Dirección," "Profesión"
User wants output form in English: "Name," "Address," "Profession"
AI cannot translate the entire form structure to a different language

Best Practice: Use forms in the language your data is already in. Avoid translation whenever possible.

How are special characters ensured to display correctly in the final PDF?

Font Embedding:

Challenge: PDFs require specific fonts to render special characters. If the font is missing, characters display as □ (tofu) or incorrectly.

Instafill.ai Solution:

Character Detection: AI identifies special characters in your data
- "José" → Requires: ó (o with acute accent)
- "Müller" → Requires: ü (u with umlaut)
- "李明" → Requires: Chinese font
Font Selection: Selects appropriate font supporting required characters
- Latin extended: Arial Unicode MS, DejaVu Sans
- Chinese: Noto Sans CJK, SimSun
- Arabic: Noto Sans Arabic, Arial
- Emoji: Noto Emoji
Font Embedding: Embeds necessary font glyphs in final PDF
- PDF contains the character shapes directly, not references to system fonts
- Ensures PDF displays correctly on any device (even if that device lacks the font)
Fallback Handling: If character unsupported by primary font:
- AI tries alternate fonts
- Warns if character truly cannot be rendered (extremely rare)

Result: Final PDF opens correctly on any device, displaying all special characters perfectly.

Technical Note: Font embedding increases PDF file size slightly (typically 50–200 KB per embedded font), but ensures universal compatibility.

What about right-to-left languages like Arabic and Hebrew?

Full Bidirectional (BiDi) Support:

Challenges with RTL Languages:

Text reads right-to-left: "Hello" in Arabic is "مرحبا" (read right → left)
Numbers read left-to-right even within RTL text: "123" reads "123" (not "321")
Mixed text: "John lives in القاهرة" (English LTR + Arabic RTL in same sentence)

AI Handling:

Direction Detection: Automatically detects RTL language
Unicode BiDi Algorithm: Applies standard Unicode Bidirectional Algorithm
Correct Ordering: Ensures characters display in proper reading order
PDF RTL Setting: Sets PDF field text direction to RTL; embeds Arabic/Hebrew font in output

Supported RTL Languages:

Arabic (العربية), Hebrew (עברית), Persian (فارسی), Urdu (اردو), Yiddish (ייִדיש)

Not an Issue: Forms with mixed LTR/RTL (e.g., English form with Arabic data in some fields) handled correctly — each field has independent directionality.

Can I use multilingual data in Profiles?

Yes, fully supported.

Use Case: Company with global workforce creates employee profiles in local languages.

Example Profile (Mexican Employee):

Name: María García López
Dirección: Av. Insurgentes Sur 123, Ciudad de México, CDMX 03100
Teléfono: +52 55 1234 5678
Correo: [email protected]
RFC: GALM850101ABC (Mexican tax ID)

Profile Behavior:

Storage: All data stored with full Unicode support (special characters preserved)
Reuse: Profile can fill any form (Spanish, English, or other language)
Cross-Language: Spanish profile data can fill English form fields
- Profile "Dirección" → English form "Address"
Character Preservation: "García" stays "García" (not corrupted to "Garcia")

Best Practice:

Single Language per Profile: Keep profile data in one primary language for consistency
Transliteration Field (optional): Add English version of name for US forms
- Native: "María García López"
- English: "Maria Garcia Lopez" (ASCII-safe)

Multi-Language Profiles (Enterprise):

Store data in multiple languages within same profile
System selects appropriate language version for the form being filled

Multilingual Support

Overview

Key Capabilities

How It Works

Language Detection & Processing

Cross-Language Field Matching

Unicode and Character Encoding

Bidirectional Text (Arabic, Hebrew)

Language-Specific Features

Use Cases

Benefits

Security & Privacy

Common Questions

Related Features

Ready to get started?

Multilingual Support

Overview

Key Capabilities

How It Works

Language Detection & Processing

Cross-Language Field Matching

Unicode and Character Encoding

Bidirectional Text (Arabic, Hebrew)

Language-Specific Features

Use Cases

Benefits

Security & Privacy

Common Questions

Related Features

Profile Management

Autofill from Multiple Sources

AI-Powered Field Filling

Ready to get started?