Abstract:
How, why, and when do multilingual language models generalize? What does this look like mechanistically? And do they pick up an "accent" when trained across tongues? We begin in Classics with a purpose–built, trilingual model zoo (Ancient Greek, Latin, English): nine matched models across encoder, decoder, and encoder–decoder architectures. This controlled design lets us study multilingual learning cleanly. I outline data pipelines, pre–training, and benchmarks, then use fine–tuning and probes to quantify cross–lingual generalization and to test for stylistic accent in generation, alongside concrete use cases for ancient language NLP.
Armed with these results, we zoom out to widely used models (e.g., BLOOM) and focus on the learning trajectory over pretraining. We track how representations evolve from early, language separated organization toward a later, shared multilingual space, using neuron level analyses to show the shift from language specific features to cross–lingual abstractions, consistent with compression dynamics. Text generation provides behavioral evidence for this picture and connects the internal space to observable outputs.
Giovanni Puccetti
Frederick Riemenschneider (University of Heidelberg)
Abstract:
We present ArTST, a pre-trained Arabic text and speech transformer that extends the unified-modal framework of SpeechT5, originally developed for English. ArTST jointly learns speech and text representations within a single model architecture, enabling it to support multi-modal input and output during pre-training. This unified model can be fine-tuned individually for a variety of downstream tasks, including speech recognition, speech synthesis, speech enhancement, speaker identification, and dialect identification.Subsequently, we explore the research question “Can we train a single model simultaneously for multiple cross-modal speech-text tasks?”. Speech recognition and speech synthesis models are usually trained separately, each with its own objectives, datasets, and model parameters, resulting in two distinct large networks. We adopt the SpeechT5 framework for unified fine-tuning and report promising results in high (English) and low (Arabic) resource settings.
Giovanni Puccetti
Hawau Olamide Toyin (MBZUAI)
Abstract
Understanding how models embed tokens in vector space is critical for interpreting their behavior. This talk explores the geometric properties of Large Language Model (LLM) embeddings and Multimodal-LLM (MLLM) embeddings through three studies.The first study introduces IsoScore, a novel metric for measuring isotropy, i.e., how uniformly variance is distributed in the embedding space. This study finds that LLM representations are dominated by a small set of "outlier dimensions," defined as dimensions with exceedingly high variance and magnitude. We use IsoScore to demonstrate that reducing isotropy correlates strongly with improved LLM classification performance.Next, this talk examines how LLMs adapt their embeddings to encode task-specific knowledge, showing that outlier dimensions play a central role in storing such information.Finally, I will present ongoing work on the role of outlier dimensions in storing factual associations in MLLMs. We first propose VisualCounterfact, which consists of tuples that alter specific visual properties—color, size, and texture—of common objects. For instance, given (banana, color, yellow), we create a counterfact image (banana, color, purple) by modifying the object's pixels. Using VisualCounterfact, we locate a mechanism, dominated by outlier dimensions, for reliably controlling whether a model will answer with the counterfactual property present in the image or retrieve the world-knowledge answer from its weights.
Giovanni Puccetti
William Rudman (Brown University)
Young Researcher Seminars (27/10/2024)
Abstract
Transformer Based Language Models (LMs) are known to have outliers, specific dimensions in their hidden representations showing high magnitude when compared to others. These dimensions can compromise model performance when removed and they encode sufficient information to solve specific downstream tasks, such as language inference. This phenomenon is known for Encoder LMs, but it is less studied for Generative Large Language Models (LLMs), where longer training impacts the properties of outliers. This ongoing work aims to investigate the phenomenon of outliers in LLMs to understand which behaviours are preserved and which change. In particular we attempt to relate outliers to in context learning, one of the principal innovation provided by Generative LLMs.
Fabio Carrara
Giovanni Puccetti