Talkie is an AI language model trained only on pre-1931 texts
by Rob Beschizza · Boing BoingThere are various LLMs that focus on public domain texts, for ethical or experimental reasons, but most also incorporate modern material (such as Wikipedia) so they have a fully contemporary awareness of history, society, current events and the rest of it. Talkie, however, is a 13B-parameter language model exclusively trained on English texts prior to 1931.
talkie-1930-13b-it was finetuned using a novel dataset of instruction-response pairs extracted from pre-1931 reference works, including etiquette manuals, encyclopedias, and letter-writing manuals. The model then underwent reinforcement learning (online DPO with an LLM-as-a-judge) to improve instruction-following ability.
The result is like talking to an educated gentleman of the high imperial era. He is fascinating, amazingly well-read for the time, and yet full of bad ideas. What ideas? Oh, you know the ones.
These models are fascinating conversation partners (watch Claude prompt talkie, our 13B 1930 LM, in the widget above). But we are also excited by the possibility that the careful study of the behaviors and capabilities of vintage LMs will advance our understanding of AI in general. For example, we can evaluate LMs' ability to predict the future. Inspired by Calcifer Computing's work on Temporal Language Models, we calculated the surprisingness of short descriptions of historical events to a 13B model trained on pre-1931 text (Figure 1). We can see an increase after the knowledge cutoff, particularly pronounced in the 1950s and 1960s, followed by a plateau. We will continue to develop evals to measure with greater confidence how forecasting performance improves with model size and decays at longer horizons. Training larger vintage language models will allow us to uncover these scaling trends.
Consider MMAcevedo ("For the original human, see Miguel Acevedo"), who might already be well-approximated by an LLMAcevedo, should anyone care to train it.