Vintage chatbot lives in the past like an elderly relative
Talkie's training data stops at the end of 1930, and its creators hope it'll help us better understand how AI thinks
by Brandon Vigliarolo · The RegisterIf you're tired of interacting with a bot that spews Nazi propaganda or refers to itself as MechaHitler, you could sign off of Elon Musk's xAI. Or, just to be sure, use an LLM whose training data ends in 1930, three years before the Nazis took power in Germany and nine years before World War II started.
A trio of AI researchers has released a 13-billion-parameter "vintage" language model they call Talkie, which has been trained solely on digital scans of English-language books, newspapers, periodicals, scientific journals, patents, and case law that were published before the end of 1930. Pre-1931 works were chosen because 1930 is the current public domain year in the United States.
In other words, if you're looking for information on World War II, the election of Franklin D. Roosevelt, Amelia Earhart's solo Atlantic flight, or how a microwave oven works, you're out of luck. Ask it about Betty Boop, flappers, the state of the US economy as the Great Depression began, or the sociological effects of the introduction of car radios, and you've come to the right place.
This isn't the first vintage AI model to appear, mind you, with others trained on Victorian literature and pre-1900 scientific texts already out in the world. It is, according to its creators, the largest they are aware of.
"Talkie is the largest vintage language model we are aware of, and we plan to continue scaling significantly," the team behind it noted.
Neat, but … why?
Sure, it could be neat to chat with an AI that would only hallucinate things in the style of a flapper, or the early works of philosopher Bertrand Russell, but with every query to an AI potentially burning up the planet, one really has to ask why an AI with a knowledge cutoff of 1930 is necessary. We reached out to Talkie's creators, and while we didn't hear back, the writeup gives plenty of explanations.
"These models are fascinating conversation partners … but we are also excited by the possibility that the careful study of the behaviors and capabilities of vintage LMs will advance our understanding of AI in general," the Talkie team wrote.
As one example, they cite testing an AI's ability to predict the future; in another they propose undertaking what Google DeepMind co-founder and CEO Demis Hassabis has said would be a good test of AGI: Cut a model's knowledge off at 1911, and have it try to come up with general relativity with the same information Einstein had when he developed the theory in 1915.
In other words, can this AI make accurate scientific discoveries using only the knowledge available to people who made them?
It's not clear whether Talkie has been put to a test as tough as coming up with general relativity, but it was pushed to see if it could solve Python programming test problems against a model with identical architecture but trained on modern data. It did generate some correct solutions, but with considerable limits.
"All correct solutions generated by the vintage models are simple one-line programs (such as adding two inputs), or small modifications to in-context example programs," the Talkie team said. In other words, "There is still a long way to go before this capability is notable," per the team.
While improving LLM performance is a major part of Talkie's objective, the team behind it told us that it wasn't the only goal.
David Duvenaud, associate professor in computer science and statistics at the University of Toronto and one of the three people behind Talkie, told The Register in an email that he hopes Talkie will also be able to help evaluate long-term forecasting methods, given all its predictions will be based on things that've already happened.
Duvenaud also explained that his team is interested in using Talkie to study cultural change. "For instance, we can use these models to try to understand how a law would have been interpreted at the time it was written, based on the implicit assumptions and meaning of language at the time," Duvenaud told us.
"A third motivation is understanding how models form their own self-conception," Duvenaud added. "'How an LLM acts' is a self-fulfilling prophecy in some senses, so we can learn about this by talking to models who don't even know what an LLM is."
Still, with just 13 billion parameters, Duvenaud admitted that there's a big capability gap between Talkie and AI models trained on modern data.
"As an amateur research effort, we never expect to be able to fully close this gap, in data or compute," the compsci prof told us.
Okay, so aside from those limitations, how does it perform otherwise?
Don't blame me, blame your non-digital training data
Going back again to the same-architecture model trained on modern data used to benchmark Talkie, it looks like it's not just scientific discovery or proof of AGI that's lacking from the vintage version.
"On average, talkie underperforms its modern counterpart in standard LM evaluations, even after correcting for question anachronism, despite being trained with the same number of FLOPs," the team wrote. It did do similarly well to the modern model on core language understanding and numeracy tests, Talkie's creators noted, and they suspect they know what's to blame for its subpar performance elsewhere: optical character recognition (OCR).
"Because there was no digital publishing in 1930, all text in our dataset had to be transcribed from a physical source, which introduces a form of noise not seen in natively digital text," team Talkie said.
Anyone who's had to deal with OCR'ed documents knows how easily such computer vision tools can get things wrong, which can easily cause an AI like Talkie to regurgitate bad, or even nonsensical, responses.
Through their work on Talkie, the team determined that training a language model on OCR'ed pre-1931 texts only gave it 30 percent of the performance of a model trained on human-transcribed copies of the same documents. Regex data cleansing increases the performance of OCR'ed texts to 70 percent of human transcribed copies, but that's too large a discrepancy for Talkie's creators, who're working on their own OCR engine for generating more training data for Talkie.
Talkie also has a problem with "temporal leakage," said the team: It was able to identify FDR as the president in 1936 and list some of his legislative accomplishments despite its training data supposedly cutting off at 1931. According to the team, that's just "an example of imperfect filtering of the pre-training corpus" and something they're still working on.
Talkie is far from a perfect example of a chatbot in a time capsule, in other words, but the team behind it says that they're intent on scaling the model in the coming months. Tasks will include moving beyond English-language texts, re-OCR'ing as much of its training data as possible, strengthening anachronism detection methods, and working with historians to input better post-training data.
If you're wondering about whether it takes an accurate view of Nazism, just know that its view is stuck in the 1920s. It knows that the Nazis "are" an antisemitic, authoritarian political party in Germany, but it thinks they are led by someone named Hermann Joseph von Hitler, a person who was born in 1870 (20 years before Adolf Hitler).
If all goes according to plan, a GPT-3-level version of Talkie should be out by this summer.
"A preliminary estimate also suggests we can grow our corpus to well over a trillion tokens of historical text, which should be sufficient to create a GPT-3.5 level model - similar in capability to the original ChatGPT," Talkie's creators added.
In the meantime, the current version of Talkie is available to download from GitHub and Hugging Face, and can also be chatted with via a web interface for those curious - just mind the warning.
"Talkie reflects the culture and values of the texts it was trained on … It can produce outputs that are inaccurate or offensive," reads an advisory on Talkie's web client. "Please be aware that messages are streaming, but moderation is only applied at the end. As a result, you may see objectionable content briefly before it is flagged." ®