The Internet's Most Powerful Archiving Tool Is in Peril

by · WIRED

Comment
LoaderSave StorySave this story
Comment
LoaderSave StorySave this story

This month, USA Today published an excellent report that revealed how US Immigrations and Customs Enforcement delayed disclosing key information about the impacts of its detainment policies. The authors used the Internet Archive’s Wayback Machine to compile and analyze detention statistics from ICE and track how the agency had changed under the Trump administration. The story is one of countless examples of how the Wayback Machine, which crawls and preserves web pages, has helped preserve information for the public good. It was also, Wayback Machine director Mark Graham says, “a little ironic.”

USA Today Co., the publishing conglomerate formerly known as Gannet that runs both its namesake paper and over 200 additional media outlets, bars the Wayback Machine from archiving its work. “They're able to pull together their story research because the Wayback Machine exists. At the same time, they're blocking access,” Graham says.

A number of other major journalism organizations have also recently moved to restrict the Wayback Machine from archiving their stories, including The New York Times. According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles.

USA Today Co. spokesperson Lark-Marie Anton emphasized that “this effort is not about specifically blocking the Internet Archive” but instead part of the company’s broader efforts to block all scraping bots. Robert Hahn, the Guardian’s director of business affairs and licensing, says that it has been in conversation with the Archive over “concerns over potential misuse by AI companies of content sets crawled for preservation purposes.”

Now, individual reporters are pushing back on this trend. This week, advocacy organizations including the Electronic Frontier Foundation and Fight for the Future rallied journalists around the Wayback Machine’s cause. The coalition collected more than 100 signatures from working journalists who recognize the tool’s value and presented a letter of support to the Internet Archive. Signatories range from television mainstay Rachel Maddow to independent reporters like Spitfire News’ Kat Tenbarge and User Mag’s Taylor Lorenz. “In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history,” the letter reads. “With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism’s record increasingly falls to the Internet Archive.”

Laura Flynn, a signatory and supervising podcast producer at The Intercept, says that the Internet Archive has been an “essential tool” throughout her career, playing an instrumental role in fact checking and surfacing audioclips. Another signatory, Chicago Reader writer Micco Caporale, says the Wayback Machine helps when writing about older bands and cultural figures by providing access to old fan sites that would otherwise be lost to time.

Caporale says the tool has also been useful in their role as a union organizer. “I've also been using the Wayback Machine a ton in my union organizing work to find old job listings so we know what the company claimed to hire people for vs. what duties they actually assigned or to see how different positions have been retooled at different points,” Caporale says. “These posts also help us keep track of pay fluctuations across the organization over time.”

Other publishers have justified their decision to block the Wayback Machine by pointing to concerns about how tech companies may use the Internet Archive’s data to train artificial intelligence models. New York Times spokesperson Graham James says that “the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.” (The Times declined to clarify whether this was something that was actually happening or rather a hypothetical concern.)

Reddit has previously said that concerns about AI also led it to block the Wayback Machine crawler. There’s an ongoing war between publishers and AI companies over the legality of AI tools training on their content without permission; many of the over 100 AI copyright lawsuits in the United States focus on this issue. Tech companies use content from all over the internet, and because the Wayback Machine offers such an extensive trove of material, it is considered a particularly appealing data source.

The Internet Archive has been around for 30 years and has archived over a trillion web pages. The nonprofit has weathered several major legal fights since 2020. Most recently, it settled with a group of major music publishers that had been seeking damages of up to $700 million over the Archive’s Great 78s project, which archived vintage recordings. Although there’s no major financial penalty at stake right now, the growing trend of media outlets blocking the Wayback Machine still poses a serious threat to its mission.

There is no widely available public tool comparable to the Wayback Machine, and if it continues to lose access to major news sources, its preservation efforts could erode to the point where early digital records of history become much harder to access, or are even lost altogether. Notably, the tool has been used in reporting on The New York Times: In 2016, the paper came under scrutiny for editorial changes it made to an article on US senator and then-presidential candidate Bernie Sanders of Vermont. The revisions were first tracked using the Wayback Machine.

If a similar situation arose today, watchdog media reporters may struggle to track older versions of Times articles in the same way. A kneecapped Wayback Machine isn’t just bad news for accountability journalism—it will also be a blow to the legal system, as pages archived by the tool are frequently cited as evidence in litigation across the United States.

The Internet Archive’s Mark Graham hasn’t given up hope that some of the publishers currently blocking its crawlers may eventually change course. He says that the nonprofit is “in conversation” with the Times and other outlets. But for now, Graham says, “there's no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what's going on in our world.”