Individual text corpora for the language modeling of personality, knowledge and intelligence
This project is part of the long-term priority program of the German Research Foundation New Data Spaces for the Social Sciences. The overall aim of this project is to create methods for a save handling of Big Data, which are primarily combined with survey data from large panel studies. We like to establish best practices for data protection. One personal concern of our sub-project is to establish save methods for data re-use. Overall, we like to establish data ethics..
In particular, our sub-project project relies on web search history and web tracking data to generate individual text corpora (ICs) from which we train (or fine-tune large) language models for each participant to obtain a computational model of each individual semantic structure. First, the similarities of ICs to personality- and knowledge-descriptive terms are used as features for the predictive modeling of the corresponding survey answers. We see this as an implicit measure of personality and knowledge, which can be compared with the explicit survey answers. We also examine the fine-grained interplay of personality, knowledge and specific interests in a longitudinal perspective by relying on sub-corpora reflecting different time frames. Second, we like to measure the IQ of large language models in similarity-based, analogical reasoning and cloze completion tasks. We compare these with human norms of fluid and crystallized intelligence of an IQ test. Then we examine whether language models trained by ICs can replace individual intelligence test answers and aim to gain a deeper understanding of the mechanisms leading to intelligent behavior.
Present funding period: 2024-2027 (HO 5139/6-1) for Christoph Wigbels and Florian Grether
Author: Markus J. Hofmann
References:
- Hofmann, M. J., Jansen, M. T., Wigbels, C., Briesemeister, B., & Jacobs, A. M. (2024). Individual Text Corpora Predict Openness, Interests, Knowledge and Level of Education. In Proceedings of the Workshop on Cognitive Aspects of the Lexicon@ LREC-COLING 2024 (pp. 14-25).
- Hofmann, M. J., Müller, L., Rölke, A., Radach, R., & Biemann, C. (2020). Individual corpora predict fast memory retrieval during reading. In Proceedings of the 6th workshop on Cognitive Aspects of the Lexicon (CogALex-VI) (p. 1-11). Barcelona,