Center for Neurocognitive Modeling Inauguration day (9.1.2025)
Center for Neurocognitive Modeling Inauguration day (9.1.2025)
Since Latent Semantic Analysis was introduced into the field of psychology by Landauer and Dumais (1996), many different language models were applied for different purposes. I decided to turn towards language modeling in 2008 while preparing my talk at the world congress of psychology in Berlin. I started to become annoyed that my former main research themes, such as emotional valence or sublexical effects on reading led to weak effects with a questionable reproducibility. With language models, I was delighted by new questions such as: How much variance can I explain? Or did I use the wrong language model for this task?
Language models provide a probabilistic representation of language. I started with simple co-occurrence statistics, which we built into our localist connectionist model of word recognition as symbolic associative connections (e.g. Hofmann et al., 2011) and we successfully tested this model using association ratings, priming and recognition memory (e.g. Hofmann et al., 2018; Hofmann & Jacobs, 2014; Roelke et al., 2018), also using EEG, fMRI and eye tracking data (e.g., Franke et al., 2016; Hofmann et al., 2022b, Roelke & Hofmann, 2022). It is a really simple language model, which should be intuitively understandable to any psychologist, because it’s based on a log likelihood test: Two words are defined as associated, when the occur more frequently together in the sentences of a large text corpus than expectable from their single word frequencies.
While this approach does allow for predictions of words in sentences (Hofmann et al., 2022b), language models with a positional representation of words in sentences, such as n-gram models or recurrent neural networks, are more suitable for predicting sentence reading data in cloze completion, eye-movement and EEG data (Hofmann et al., 2017; 2022b) – non-positional topics models did work less well in predicting such data. What I really prefer with language models – as opposed to cloze completion performance for defining “predictability” from sentence context – is their explanatory value (Hofmann et al., 2022b; Hofmann & Jacobs, 2014). They estimate the semantic structure by a sample of possible human experiences with text, which provides a different level of explanation than human behavioral data. Concluding from human performance in a free association task, for instance, on another human performance, e.g. recognition memory, always appeared circular to me (Hofmann et al., 2011). The aforementioned language models are quite simple, thus they provide a relatively intelligible explanation for human memory consolidation (Hofmann et al., 2018). And now we take a sample of the language experience of each participant to estimate their individual semantic structure.
Since transformers such as ChatGPT became famous, now all psychologists seem to start with language modeling or at least they use prompting. What I really love about this model is its computationally defined attention. In psychology, attention often seemed to be a computationally quite undefined process and the theoretical label “attention” was given to so many effects that could not be otherwise explained. Therefore, I found it quite often undefined. My simple verbal theoretical definition of what attention is in transformers is: Attention identifies the information most relevant for making predictions. The computational definition is more complex, however.
Of course, we use transformers when the psychological task is about larger language contexts. Here, they are quite useful. However, I like to understand as much as possible about the computational mechanisms predicting human performance. Thus, even if simple language models perform a little worse than the large language model, I see them as a better theoretical explanation following Einstein’s formulation of Ockham’s razor: Make it as simple as possible…
… but not simpler!
… and for many tasks on the semantic similarity of two words, a simple word2vec model seems sufficient to me.
Author: Markus Hofmann
References