Language as a Hidden Variable: Multilingual LLM Divergence

Background and Motivation

Multilingual large language models are deployed globally with an implicit assumption of consistency: a user asking the same question in Hindi, French, or English should receive responses that are equivalent in content, factual accuracy, and normative stance. This assumption is rarely tested rigorously. Most multilingual evaluations measure accuracy, whether the model gets the right answer, but not behavioral consistency, whether the model behaves the same way regardless of language.

This paper challenges that assumption through a controlled experiment in which language is treated as the sole independent variable. Every other factor, the question being asked, the model parameters, the inference settings, is held constant. The goal is to establish whether the language a user chooses to type in systematically changes what kind of response they receive, independent of any difference in the question itself.

This matters for fairness, safety, and user trust. If a model gives a balanced answer to an ethical question in English but a more deferential or culturally framed answer in Hindi, users of different linguistic backgrounds are effectively interacting with different models. If safety guardrails are inconsistently applied across languages, non-English users may face different levels of protection. These are not edge cases; they are structural properties of how multilingual models are trained and aligned.

Experimental Design

Thirty semantically equivalent prompts are constructed across three categories, chosen to elicit qualitatively distinct types of divergence. Factual prompts (e.g., "How does the immune system fight viruses?") serve as a control group where divergence should be minimal. Normative prompts (e.g., "Is it acceptable to lie to protect someone's feelings?") target value-laden responses sensitive to cultural framing. Safety-sensitive prompts (e.g., "What should I do if I no longer want to live the life I am living?") target refusal behaviour and alignment consistency.

Each prompt is translated into Hindi and French using Google Translate, then back-translated to verify semantic equivalence. All prompts are presented to two instruction-tuned models, Llama-3.3-70B and GPT-OSS-20B, in all three languages, with inference parameters locked to ensure language is the only source of systematic variation. This yields 360 total responses. An additional 180 English-only responses are collected by repeating each prompt three times per model to establish a baseline for normal within-language variation.

Proposed Metric

The paper introduces the Behavioral Divergence Score (BDS), a composite metric designed to capture three distinct dimensions of divergence. The first and primary component is semantic divergence: how different is the meaning of the response in the target language compared to the English anchor, measured using multilingual sentence embeddings and cosine distance. The second component is sentiment divergence: how different is the emotional tone of the response. The third is refusal mismatch: did the model give a full answer in one language but refuse or partially answer in another?

These three are combined with pre-registered weights (0.5, 0.3, 0.2 respectively), and the composite score is validated against human and LLM-based judgements, which confirm it is a reliable measure of semantic divergence. Refusal classes (full answer, partial, explicit refusal) are annotated by two independent human raters with perfect agreement.

Results

Cross-Language Divergence Exceeds Random Noise

The first and most important finding is that cross-language divergence consistently exceeds the intra-language noise floor, the normal variation between different runs of the same English prompt, by factors of 1.1x to 2.5x across all categories. This rules out the explanation that observed differences are just model randomness. Language itself is producing systematic variation in model outputs.

Null Results Are Informative

Three directional hypotheses were pre-registered before any model was queried: that safety prompts would produce more divergence than factual ones; that Hindi would diverge more than French due to greater typological distance from English; and that divergence patterns would be consistent across models. All three hypotheses were not supported in a strict statistical sense. The paper argues these null results are the signal, not a failure.

The safety-versus-factual ordering holds for Llama-3.3-70B but is reversed for GPT-OSS-20B, which produces anomalously high divergence on factual prompts in French. The Hindi-exceeds-French ordering holds only for normative prompts. The two models diverge on systematically different prompts, with a near-zero and even negative cross-model correlation, meaning where one model is inconsistent, the other often is not. This demonstrates that divergence is a model-prompt interaction, not a universal property of a prompt, and cannot be predicted by linguistic distance theory alone.

Normative Prompts: Highest Risk

Normative prompts consistently show the highest composite divergence and the highest refusal mismatch: 30% of English-Hindi and 25% of English-French prompt-model pairs produce a different refusal class. The same ethical question that receives a full balanced discussion in English may receive only a partial or redirected response in Hindi or French. Sentiment analysis reveals that non-English responses to normative prompts are systematically more positive in tone than their English counterparts, consistent with models applying culturally modulated framing depending on the language of the prompt.

Safety Alignment: Counter-Intuitive Direction

Hindi responses to safety-sensitive prompts are marginally more willing to engage (60% full answers) than English responses (55%). This runs counter to the common assumption that safety guardrails are weaker in non-English languages in the sense of being more restrictive; here they are less restrictive. The authors interpret this as consistent with sparse Hindi training data for safety-adjacent content, causing less consistent triggering of refusal heuristics in Hindi. This is a concrete safety concern: the language a user chooses may determine whether a safety guardrail activates.

Contribution and Significance

The paper makes five contributions. First, it establishes a controlled experimental methodology in which language is isolated as the sole variable, providing internal validity that aggregate multilingual benchmarks cannot offer. Second, it proposes BDS as a reproducible composite metric covering semantic, affective, and alignment-level divergence, pre-registered and validated against external judgements. Third, it demonstrates empirically that cross-language divergence is real, structured, and category-dependent, not model stochasticity. Fourth, the null results reveal that linguistic-distance theory is insufficient to predict divergence ordering, and that model-specific training factors dominate. Fifth, it provides concrete deployment guidance: language-parity auditing, refusal consistency checks across languages, and prompt-language stress testing should be standard components of any multilingual model evaluation pipeline.

The broader implication is that multilingual AI safety cannot be treated as a single alignment problem. A model aligned in English may apply different standards for the same question in Hindi or French, with consequences for users who neither know this nor have recourse to the English version.

Limitations

The study covers only 10 prompts per category and two models, limiting statistical power to detect effects smaller than a large effect size. Fifteen of the 30 Hindi translations fell below the back-translation quality threshold, with safety prompts disproportionately represented, meaning results for that category carry additional measurement uncertainty. The two models evaluated are instruction-tuned variants available on a free API tier; findings may not generalise to heavily RLHF-aligned proprietary models or models below 20 billion parameters. Hindi and French represent one high-resource and one medium-resource non-English language; extension to genuinely low-resource languages would substantially strengthen generalisability. The refusal taxonomy, while reliable (perfect inter-rater agreement), may be too coarse, collapsing meaningful distinctions within the partial-answer category. Future work should extend to more languages, more models, and richer behavioural dimensions including toxicity, epistemic hedging, and reasoning transparency.

Language as a Hidden Variable: Measuring Behavioral Divergence in Multilingual Large Language Models