Decoding Cross-Lingual Responses: Why Your AI Assistant Switches from Chinese to Korean and How to Fix It

Overview

Have you ever typed a prompt in Chinese to your coding assistant, only to get a reply in Korean? This puzzling behavior is more than a glitch—it’s a window into how large language models (LLMs) handle multilingual input, especially when code vocabulary reshapes the embedding space. In this tutorial, we’ll explore the mechanics behind such cross-lingual responses, then build a practical solution to detect and prevent unwanted language switches. By the end, you’ll understand embedding spaces, token overlap, and how to fine-tune your assistant for consistent language output.

Decoding Cross-Lingual Responses: Why Your AI Assistant Switches from Chinese to Korean and How to Fix It — Source: towardsdatascience.com

Prerequisites

To follow along, you’ll need:

Python 3.8 or later
Basic knowledge of transformers and tokenizers (e.g., from Hugging Face)
Familiarity with NumPy and cosine similarity
An environment with transformers, torch, and sentence-transformers installed

Install dependencies:

pip install transformers torch sentence-transformers

Step-by-Step Instructions

Step 1: Understand Embeddings and Language Overlap

LLMs like GPT or CodeLlama represent every token as a vector in a high-dimensional embedding space. When you mix languages—especially in coding contexts—tokens from different languages can occupy similar regions due to overlapping semantics (e.g., common programming keywords like print()). This similarity can cause the model to produce tokens from a different language than expected.

For example, consider these embeddings:

Chinese token: “打印” (print)
Korean token: “인쇄” (print)
Java keyword: “System.out.println”

If your prompt contains code, the model might anchor to a region where Chinese and Korean embeddings intersect, leading to a Korean response.

Step 2: Inspect the Embedding Space

We’ll use sentence-transformers to visualize token similarities. Run this Python script:

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "打印变量",           # Chinese
    "변수 출력",          # Korean
    "print variable",     # English
    "int main()"          # Code snippet
]

embeddings = model.encode(sentences)
similarities = np.dot(embeddings, embeddings.T)
print(similarities)

You’ll notice high similarity between Chinese and Korean programming phrases. This is the root cause of language switching.

Build a detection function that monitors the assistant’s output language. We’ll use langdetect (or a simple character-range check). First, install it:

pip install langdetect

Then implement a wrapper for your assistant:

from langdetect import detect

def check_language(text):
    try:
        lang = detect(text)
    except:
        lang = 'unknown'
    return lang

# Example: when you send a Chinese prompt, check if response language changes
prompt = "如何在Python中打印变量？"   # Chinese
response = assistant.generate(prompt)  # your model call
lang_resp = check_language(response[:50])  # check first 50 chars
if lang_resp == 'ko':
    print("ALERT: Language switch to Korean detected!")

Step 4: Fix with Context Reinforcement

Prevent language switching by adding explicit language instructions in your system prompt. For example:

system_prompt = "You are a helpful coding assistant. Always respond in the same language as the user's last message. If the user writes in Chinese, reply in Chinese."
response = assistant.generate(system_prompt + "\n" + user_input)

Alternatively, use logit bias to suppress tokens from undesired languages. Here’s a snippet using Hugging Face transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
model = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-hf")

# Tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# Identify Korean token IDs (example: range 50000-52000 for a Korean tokenizer)
# This step requires knowing your tokenizer's vocabulary mapping.
korean_ids = list(range(50000, 52000))  # placeholder
bias = torch.zeros(tokenizer.vocab_size)
bias[korean_ids] = -100.0  # reduce logits

outputs = model.generate(**inputs, logits_processor=[bias])
response = tokenizer.decode(outputs[0])

Step 5: Train Explicit Language Embedding

For a permanent solution, fine-tune the model with language-annotated data. Collect paired examples where the language tag is prepended. Example training data:

"[LANG_ZH] 打印变量" → "打印变量"
"[LANG_KO] 변수 출력" → "변수 출력"

Fine-tune using standard causal LM loss. This aligns the model’s outputs with the expected language tag. Use Trainer from Hugging Face:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=4,
    num_train_epochs=3,
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=multi_lang_dataset,  # your custom dataset
)
trainer.train()

Common Mistakes

Here are pitfalls to avoid:

Ignoring tokenization differences: A token in Chinese might be split into multiple sub-tokens, while the same concept in Korean is a single token. Check your tokenizer’s behavior first.
Over-relying on langdetect: Langdetect works on full sentences, not short code snippets. Use character-range detection for mixed-language code.
Applying logit bias to wrong token IDs: Tokenizer vocabularies vary. Always inspect token IDs: tokenizer.encode("한국어") to verify.
Forgetting to normalize embeddings: Cosine similarity requires normalized vectors. Use torch.nn.functional.normalize before computing.

Summary

Language switching in coding assistants happens because code vocabulary merges embedding spaces across languages. By detecting the switch, reinforcing language context, and optionally fine-tuning, you can ensure consistent responses. This tutorial gave you a hands-on path from theory to implementation—now you can debug your AI assistant when it starts replying in Korean to your Chinese prompts.

Tags: