Hubbry Logo
Internet ArchiveInternet ArchiveMain
Open search
Internet Archive
Community hub
Internet Archive
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Internet Archive
Internet Archive
from Wikipedia

Since late 2009, the headquarters of the Internet Archive has been the building that formerly housed the Fourth Church of Christ, Scientist in San Francisco, California.

Key Information

The Internet Archive is an American non-profit library founded in 1996 by Brewster Kahle that runs a digital library website, archive.org.[4][5][6] It provides free access to collections of digitized media including websites, software applications, music, audiovisual, and print materials. The Archive also advocates a free and open Internet. Its mission is committing to provide "universal access to all knowledge".[7]

The Internet Archive allows the public to upload and download digital material to its data cluster, but the bulk of its data is collected automatically by its web crawlers, which work to preserve as much of the public web as possible. Its web archive, the Wayback Machine, contains hundreds of billions of web captures.[8][9] The Archive also oversees numerous book digitization projects, collectively one of the world's largest book digitization efforts.

History

[edit]
Headquarters in Building 116 of the Presidio of San Francisco in 2008

Brewster Kahle founded the Archive in May 1996, around the same time that he began the for-profit web crawling company Alexa Internet.[10][11] The earliest known archived page on the site, the download page for Internet Explorer, was saved on May 10, 1996, at 14:42 UTC (7:42 am PDT).[12][better source needed] By October of that year, the Internet Archive had begun to archive and preserve the World Wide Web in large amounts.[13] The archived content became more easily available to the general public in 2001, through the Wayback Machine.

In late 1999, the Archive expanded its collections beyond the web archive, beginning with the Prelinger Archives. Now, the Internet Archive includes texts, audio, moving images, and software. It hosts a number of other projects: the NASA Images Archive, the contract crawling service Archive-It, and the wiki-editable library catalog and book information site Open Library. Soon after that, the Archive began working to provide specialized services relating to the information access needs of the print-disabled; publicly accessible books were made available in a protected Digital Accessible Information System (DAISY) format.[14]

According to its website:[15]

Most societies place importance on preserving artifacts of their culture and heritage. Without such artifacts, civilization has no memory and no mechanism to learn from its successes and failures. Our culture now produces more and more artifacts in digital form. The Archive's mission is to help preserve those artifacts and create an Internet library for researchers, historians, and scholars.

In August 2012, the Archive announced[16] that it had added BitTorrent to its file download options for more than 1.3 million existing files, and all newly uploaded files.[17][18] This method is the fastest means of downloading media from the Archive, as files are served from two Archive data centers, in addition to other torrent clients which have downloaded and continue to serve the files.[17][19]

On November 6, 2013, the Internet Archive's headquarters in San Francisco's Richmond District caught fire,[20] destroying equipment and damaging some nearby apartments.[21] According to the Archive, it lost a side-building housing one of 30 of its scanning centers; cameras, lights, and scanning equipment worth hundreds of thousands of dollars; and "maybe 20 boxes of books and film, some irreplaceable, most already digitized, and some replaceable".[22] The nonprofit Archive sought donations to cover the estimated $600,000 in damage.[23]

An overhaul of the site was launched as beta in November 2014, and the legacy layout was removed in March 2016.[24][25]

In November 2016, Kahle announced that the Internet Archive was building the Internet Archive of Canada, a copy of the Archive to be based somewhere in Canada. The announcement received widespread coverage due to the implication that the decision to build a backup archive in a foreign country was because of the upcoming presidency of Donald Trump.[26][27][28]

Beginning in 2017, OCLC and the Internet Archive have collaborated to make the Archive's records of digitized books available in WorldCat.[29]

Since 2018, the Internet Archive visual arts residency, which is organized by Amir Saber Esfahani and Andrew McClintock, helps connect artists with the Archive's over 48 petabytes[30] of digitized materials. Over the course of the yearlong residency, visual artists create a body of work which culminates in an exhibition. The hope is to connect digital history with the arts and create something for future generations to appreciate online or off.[31] Previous artists in residence include Taravat Talepasand, Whitney Lynn, and Jenny Odell.[32]

The Internet Archive acquires most materials from donations,[33] such as hundreds of thousands of 78 rpm discs from Boston Public Library in 2017,[34] a donation of 250,000 books from Trent University in 2018,[35] and the entire collection of Marygrove College's library after it closed in 2020.[36] All material is then digitized and retained in digital storage, while a digital copy is returned to the original holder and the Internet Archive's copy, if not in the public domain, is lent to patrons worldwide one at a time under the controlled digital lending (CDL) theory of the first-sale doctrine.[37]

On June 1, 2020, four large publishing houses – Hachette Book Group, Penguin Random House, HarperCollins, and John Wiley – filed a lawsuit against the Internet Archive before the United States District Court for the Southern District of New York, claiming that the Internet Archive's practice of controlled digital lending constituted copyright infringement. On March 25, 2023, the court found in favor of the publishers. The negotiated judgment of August 11, 2023, barred the Internet Archive from digitally lending books for which electronic copies are on sale.

Also on August 11, 2023, the music industry giants Universal Music Group, Sony Music and Concord (together with their respective labels Capitol Records, Arista Records and CMGI Recorded Music Assets) sued the Internet Archive before the same United States District Court for the Southern District of New York over the Internet Archive's Great 78 Project for $621 million in damages from alleged copyright infringement.[38][39][40] The lawsuit was settled in September 2025.[41]

In September 2024, Google and the Internet Archive announced a collaboration where links to the wayback machine would be included in the 'more about this page' menu in Google Search. This collaboration effectively replaced Google's own Google Cache service that it had retired earlier that year.[42][43]

On July 24, 2025, Internet Archive was designated as a Federal Depository Library by the U.S. Senate, allowing it to store public access government records.[44][45] It opened a new headquarters for its European branch on 19 September 2025.[46]

Cyberattacks

[edit]

During the week of May 27, 2024, the Internet Archive suffered a series of distributed denial of service (DDoS) attacks that made its services unavailable intermittently, sometimes for hours at a time, over a period of several days.[47][48][49] The attack was claimed on May 28 by a hacker group called SN_BLACKMETA,[50][51] with possible links to Anonymous Sudan.[52] The incident drew a comparison with the 2023 British Library cyberattack, which affected the UK Web Archive.[53]

Internet Archive main page showing partially available services

Beginning October 9, 2024, the Internet Archive's team, including archivist Jason Scott and security researcher Scott Helme, confirmed DDoS attacks, site defacement, and a data breach. The purported hacktivist group SN_BLACKMETA again claimed responsibility.[54] A pop-up on the defaced site claimed that there was a "catastrophic" security breach, stating "Have you ever felt like the Internet Archive runs on sticks and is constantly on the verge of suffering a catastrophic security breach? It just happened. See 31 million of you on HIBP!"[55][51] It was reported that about 31 million user accounts were affected, and compromised in a file called "ia_users.sql", dated September 28, 2024.[54][56] The attackers stole users' email addresses and Bcrypt-hashed passwords.[57]

On October 11, Kahle said that the data is safe, and will bring the service back to normal "in days, not weeks."[58][59][60]

On October 13, the Wayback Machine was restored in a read-only format, while archiving web pages was temporarily disabled.[61]

On October 14, Brewster Kahle said "[the Wayback Machine] volume is back to normal: 1,500 requests per second".[62]

On October 15, 2024, the website was still mostly offline for "prioritizing keeping data safe at the expense of service availability."[63]

On October 20, threat actors stole unrotated API tokens and breached Internet Archive on its Zendesk email support platform; they also claimed responsibility for the other breaches yet stated that SN_BLACKMETA was behind just the DDoS attacks.[64][65] Having been told that threat actors (behind other breaches than SN_BLACKMETA's DDoS attacks) leaked some stolen data to others in the data-trafficking community, Bleeping Computer posited that said threat actors breached the "well-known and extremely popular" Internet Archive not to extort money but to "gain cyber street cred," thus "increasing their reputation."[64][65]

On October 21, Internet Archive went back online in a read-only manner.[66]

On October 22, all Internet Archive services temporarily went offline,[67][68] but later that same day, only the Wayback Machine, Archive-It, and blog.archive.org were resumed.[citation needed]

On October 23, archive.org, the Wayback Machine, Archive-It, and the Open Library services all resumed but with some features, such as logging in, still unavailable until the staff announced it back available in the next day or two.[69]

On October 25, the login feature was made available and the site has remained active since.[citation needed]

Operations

[edit]
Mirror of the Internet Archive in the Bibliotheca Alexandrina

The Archive is a 501(c)(3) nonprofit operating in the United States. In 2019, it had an annual budget of $37 million, derived from revenue from its Web crawling services, various partnerships, grants, donations, and the Kahle-Austin Foundation.[70] The Internet Archive also manages periodic funding campaigns. For instance, a December 2019 campaign had a goal of reaching $6 million in donations.[71] It uses Ubuntu as its choice of operating system for the website servers.[72]

Brewster Kahle of the Internet Archive talks about archiving operations.

The Archive is headquartered in San Francisco, California. From 1996 to 2009, its headquarters were in the Presidio of San Francisco, a former U.S. military base. Since 2009, its headquarters have been at 300 Funston Avenue in San Francisco, a former Christian Science Church. At one time, most of its staff worked in its book-scanning centers; as of 2019, scanning is performed by 100 paid operators worldwide.[73] The Archive also has data centers in three Californian cities: San Francisco, Redwood City, and Richmond. To reduce the risk of data loss, the Archive creates copies of parts of its collection at more distant locations, including the Bibliotheca Alexandrina[74][75] in Egypt and a facility in Amsterdam.[76]

As of 2025, it is reported that Internet Archive operates six data centers,[77] mainly in California, with smaller ones in other U.S. states, Canada and Europe. They have controlled access and fire protection systems, and are monitored for security. All Internet Archive data centers adhere to ISO/IEC 27001 standard, and some of them meet additional certifications.[78]

Since 2016, Internet Archive started to work to create a decentralized prototype of the digital library. From 2020, content from Internet Archive started to be stored in Filecoin.[79] By October 2023, one petabyte of data had been uploaded to the Filecoin network.[80]

The Archive is a member of the International Internet Preservation Consortium[81] and was officially designated as a library by the state of California in 2007.[82][83]

Web archiving

[edit]

Wayback Machine

[edit]
Wayback Machine logo, used since 2001

The Wayback Machine is a service that allows archives of the World Wide Web to be searched and accessed.[84] It can be used to see what previous versions of web sites used to look like or to visit web sites that no longer even exist. The Wayback Machine was created as a joint effort between Alexa Internet (owned by Amazon.com) and the Internet Archive.[15] Hundreds of billions of web sites and their associated data (images, source code, documents, etc.) are saved in a database. As of September 5, 2024, the Internet Archive held over 866 billion web pages, more than 42.5 million print materials, 13 million videos, 3 million TV news reports, 1.2 million software programs, 14 million audio files, 5 million images, and 272,660 concerts in its Wayback Machine.[7] In October 2025 the Internet Archive announced that the Wayback Machine had archived one trillion webpages, equivalent to more than 100,000 terabytes of data.[85]

Servers at the Internet Archive headquarters in San Francisco
A purchase of additional storage at the Internet Archive

Archive-It

[edit]

Created in late 2005, Archive-It[86] is a web archiving subscription service that allows institutions and individuals to build and preserve collections of digital content and create digital archives. Archive-It allows the user to customize their capture or exclusion of web content they want to preserve for cultural heritage reasons. Through a web application, Archive-It partners can harvest, catalog, manage, browse, search, and view their archived collections.[87]

In terms of accessibility, the archived websites are full text searchable within seven days of capture.[88] Content collected through Archive-It is captured and stored as a WARC file. A primary and back-up copy is stored at the Internet Archive data centers. A copy of the WARC file can be given to subscribing partner institutions for geo-redundant preservation and storage purposes to their best practice standards.[89] Periodically, the data captured through Archive-It is indexed into the Internet Archive's general archive.

As of March 2014, Archive-It had more than 275 partner institutions in 46 U.S. states and 16 countries that have captured more than 7.4 billion URLs for more than 2,444 public collections.[citation needed] Archive-It partners are universities and college libraries, state archives, federal institutions, museums, law libraries, and cultural organizations, including the Electronic Literature Organization, North Carolina State Archives and Library, Stanford University, Columbia University, American University in Cairo, Georgetown Law Library, and many others.[citation needed]

Internet Archive Scholar

[edit]

In September 2020, Internet Archive announced a new initiative to archive and preserve open access academic journals, called Internet Archive Scholar.[90][91][92] Its full-text search index includes over 25 million research articles and other scholarly documents preserved in the Internet Archive. The collection spans from digitized copies of eighteenth century journals through the latest open access conference proceedings and pre-prints crawled from the World Wide Web.[citation needed]

General Index

[edit]

In 2021, the Internet Archive announced the initial version of the General Index, a publicly available index to a collection of 107 million academic journal articles.[93][94]

Items and collections

[edit]

The Archive stores files inside so-called items, which are similar to directories in that they can contain multiple files, but can have additional metadata such as a description and tags which make them more searchable.

Some file types can be previewed directly on the site, where as others have to be downloaded in order to be opened. If multiple multimedia files exist in an item, the website generates a playlist for video or audio files, or a slide show for pictures. If an item contains at least one video or picture, the Archive generates a preview thumbnail that can be seen on collection pages and in searches. Items can contain mixed data such as music files with an album cover picture, in which case the picture is used as thumbnail.[95][96][97][98]

Staff members of the Internet Archive organize items by placing them into so-called collections, which are pages listing multiple items.[99]

Book collections

[edit]

Text collection

[edit]
Internet Archive "Scribe" book scanning workstation
An Internet Archive in-house scan ongoing

The scanning performed by the Internet Archive is financially supported by libraries and foundations.[100] As of November 2008, when there were approximately 1 million texts, the entire collection was greater than 500 terabytes, which included raw camera images, cropped and skewed images, PDFs, and raw OCR data.[101]

As of July 2013, the Internet Archive was operating 33 scanning centers in five countries, digitizing about 1,000 books a day for a total of more than 2 million books, in a total collection of 4.4 million books – including material digitized by others and fed into the Internet Archive; at that time, users were performing more than 15 million downloads per month.[102]

The material digitized by others includes more than 300,000 books that were contributed to the collection, between about 2006 and 2008, by Microsoft through its Live Search Books project, which also included financial support and scanning equipment directly donated to the Internet Archive.[103] On May 23, 2008, Microsoft announced it would be ending its Live Book Search project and would no longer be scanning books, donating its remaining scanning equipment to its former partners.[103]

Around October 2007, Archive users began uploading public domain books from Google Book Search.[104] As of November 2013, there were more than 900,000 Google-digitized books in the Archive's collection;[105] the books are identical to the copies found on Google, except without the Google watermarks, and are available for unrestricted use and download.[a] Brewster Kahle revealed in 2013 that this archival effort was coordinated by Aaron Swartz, who, with a "bunch of friends", downloaded the public domain books from Google slowly enough and from enough computers to stay within Google's restrictions. They did this to ensure public access to the public domain. The Archive ensured the items were attributed and linked back to Google, which never complained, while libraries "grumbled". According to Kahle, this is an example of Swartz's "genius" to work on what could give the most to the public good for millions of people.[106]

In addition to books, the Archive offers free and anonymous public access to more than four million court opinions, legal briefs, or exhibits uploaded from the United States Federal Courts' PACER electronic document system via the RECAP web browser plugin. These documents had been kept behind a federal court paywall. On the Archive, they had been accessed by more than six million people by 2013.[106]

The Archive's BookReader web app,[107] built into its website, has features such as single-page, two-page, and thumbnail modes; fullscreen mode; page zooming of high-resolution images; and flip page animation.[107][108]

In October 2024, the Internet Archive agreed to accept the paper copies of 400,000 uncatalogued dissertations from the Leiden University Library, from the period 1851–2004, that the library wanted to dispose of. The University had received them from foreign Universities as part of a dissertation exchange program that had begun with its foundation in 1575, continuing for nearly 430 years. The Archive plans to digitise them and make them accessible online. The original full collection included theses by Niels Bohr, Marie Curie, Émile Durkheim, Albert Einstein, Otto Hahn, Carl Jung, J. Robert Oppenheimer, Max Planck, Luigi Pirandello, Gustav Stresemann and Max Weber.[109]

Open Library

[edit]

The Open Library is another project of the Internet Archive. The project seeks to include a web page for every book ever published: it holds 25 million catalog records of editions. It also seeks to be a web-accessible public library: it contains the full texts of approximately 1,600,000 public domain books (out of the more than five million from the main texts collection), as well as in-print and in-copyright books,[110] many of which are fully readable, downloadable[111][112] and full-text searchable;[113] it offers a two-week loan of e-books in its controlled digital lending program for over 647,784 books not in the public domain, in partnership with over 1,000 library partners from six countries[102][114] after a free registration on the web site. Open Library is a free and open-source software project, with its source code freely available on GitHub.

The Open Library faces objections from some authors and the Society of Authors, who hold that the project is distributing books without authorization and is thus in violation of copyright laws,[115] and four major publishers initiated a copyright infringement lawsuit against the Internet Archive in June 2020 to stop the Open Library project.[116]

Digitizing sponsors for books

[edit]

Many large institutional sponsors have helped the Internet Archive provide millions of scanned publications (text items).[117] Some sponsors that have digitized large quantities of texts include the University of Toronto's Robarts Library, University of Alberta Libraries, University of Ottawa, Library of Congress, Boston Library Consortium member libraries, Boston Public Library, Princeton Theological Seminary Library, and many others.[118]

In 2017, the MIT Press authorized the Internet Archive to digitize and lend books from the press's backlist,[119] with financial support from the Arcadia Fund.[120][121] A year later, the Internet Archive received further funding from the Arcadia Fund to invite some other university presses to partner with the Internet Archive to digitize books, a project called "Unlocking University Press Books".[122][123]

The Library of Congress created numerous Handle System identifiers that pointed to free digitized books in the Internet Archive.[124] The Internet Archive and Open Library are listed on the Library of Congress website as a source of e-books.[125]

Media collections

[edit]
Media reader
Microfilms at the Internet Archive
Videocassettes at the Internet Archive

In addition to web archives, the Internet Archive maintains extensive collections of digital media that are attested by the uploader to be in the public domain in the United States or licensed under a license that allows redistribution, such as Creative Commons licenses.[citation needed] Media are organized into collections by media type (moving images, audio, text, etc.), and into sub-collections by various criteria. Each of the main collections includes a "Community" sub-collection (formerly named "Open Source") where general contributions by the public are stored.[citation needed]

Audio

[edit]

Audio Archive

[edit]

The Audio Archive includes music, audiobooks, news broadcasts, old time radio shows, podcasts, and a wide variety of other audio files. As of January 2023, there are more than 15,000,000 free digital recordings in the collection. The subcollections include audio books and poetry, podcasts, non-English audio, and many others.[126] The sound collections are curated by B. George, director of the ARChive of Contemporary Music.[127]

Digital Library of Amateur Radio and Communications

[edit]

A project to preserve recordings of amateur radio transmissions, with funding from the Amateur Radio Digital Communications foundation.[128][129]

Live Music Archive

[edit]

The Live Music Archive sub-collection includes more than 170,000 concert recordings from independent musicians, as well as more established artists and musical ensembles with permissive rules about recording their concerts, such as the Grateful Dead, and more recently, The Smashing Pumpkins. Also, Jordan Zevon has allowed the Internet Archive to host a definitive collection of his father Warren Zevon's concert recordings. The Zevon collection ranges from 1976 to 2001 and contains 126 concerts including 1,137 songs.[130]

The Great 78 Project

[edit]

Launched in 2019, The Great 78 Project aims to digitize 250,000 78 rpm singles (500,000 songs) from the period between 1880 and 1960, donated by various collectors and institutions. It has been developed in collaboration with the Archive of Contemporary Music and George Blood Audio, responsible for the audio digitization.[127]

Netlabels

[edit]

The Archive has a collection of freely distributable music that is streamed and available for download via its Netlabels service. The music in this collection generally has Creative Commons-license catalogs of virtual record labels.[131][132]

Images collection

[edit]

This collection contains more than 3.5 million items.[133] Cover Art Archive, Metropolitan Museum of Art – Gallery Images, NASA Images, Occupy Wall Street Flickr Archive, and USGS Maps are some sub-collections of Image collection.[citation needed]

Cover Art Archive

[edit]
Logo of Cover Art Archive

The Cover Art Archive is a joint project between the Internet Archive and MusicBrainz, whose goal is to make cover art images on the Internet. As of April 2021, this collection contains more than 1,400,000 items.[134]

Metropolitan Museum of Art images

[edit]

The images of this collection are from the Metropolitan Museum of Art. This collection contains more than 140,000 items.[135]

NASA Images

[edit]

The NASA Images archive was created through a Space Act Agreement between the Internet Archive and NASA to bring public access to NASA's image, video, and audio collections in a single, searchable resource. The Internet Archive NASA Images team worked closely with all of the NASA centers to keep adding to the ever-growing collection.[136] The nasaimages.org site launched in July 2008 and had more than 100,000 items online at the end of its hosting in 2012.

Occupy Wall Street Flickr archive

[edit]

This collection contains Creative Commons-licensed photographs from Flickr related to the Occupy Wall Street movement. This collection contains more than 15,000 items.[137]

USGS Maps

[edit]

This collection contains more than 59,000 items from Libre Map Project.[138]

Machinima Archive

[edit]

One of the sub-collections of the Internet Archive's Video Archive is the Machinima Archive. This small section hosts many Machinima videos. Machinima is a digital artform in which computer games, game engines, or software engines are used in a sandbox-like mode to create motion pictures, recreate plays, or even publish presentations or keynotes. The archive collects a range of Machinima films from internet publishers such as Rooster Teeth and Machinima.com as well as independent producers. The sub-collection is a collaborative effort among the Internet Archive, the How They Got Game research project at Stanford University, the Academy of Machinima Arts and Sciences, and Machinima.com.[139]

Microfilm collection

[edit]

This collection contains approximately 160,000 microfilmed items from a variety of libraries including the University of Chicago Libraries, University of Illinois at Urbana-Champaign, University of Alberta, Allen County Public Library, and National Technical Information Service.[140][141]

Moving image collection

[edit]

The Internet Archive holds a collection of approximately 3,863 feature films.[142] Additionally, the Internet Archive's Moving Image collection includes: newsreels, classic cartoons, pro- and anti-war propaganda, The Video Cellar Collection, Skip Elsheimer's "A.V. Geeks" collection, early television, and ephemeral material from Prelinger Archives, such as advertising, educational, and industrial films, as well as amateur and home movie collections.[citation needed]

Subcategories of this collection include:

  • IA's Brick Films collection, which contains stop-motion animation filmed with Lego bricks, some of which are "remakes" of feature films.[citation needed]
  • IA's Election 2004 collection, a non-partisan public resource for sharing video materials related to the 2004 United States presidential election.[citation needed]
  • IA's FedFlix collection, Joint Venture NTIS-1832 between the National Technical Information Service and Public.Resource.Org that features "the best movies of the United States Government, from training films to history, from our national parks to the U.S. Fire Academy and the Postal Inspectors"[143]
  • IA's Independent News collection, which includes sub-collections such as the Internet Archive's World At War competition from 2001, in which contestants created short films demonstrating "why access to history matters". Among their most-downloaded video files are eyewitness recordings of the devastating 2004 Indian Ocean earthquake.[citation needed]
  • IA's September 11 Television Archive, which contains archival footage from the world's major television networks of the terrorist attacks of September 11, 2001, as they unfolded on live television.[144]

Open Educational Resources

[edit]

Open Educational Resources is a digital collection at archive.org. This collection contains hundreds of free courses, video lectures, and supplemental materials from universities in the United States and China. The contributors of this collection are ArsDigita University, Hewlett Foundation, MIT, Monterey Institute, and Naropa University.[145]

TV News Search & Borrow

[edit]
TV tuners at the Internet Archive

In September 2012, the Internet Archive launched the TV News Search & Borrow service for searching U.S. national news programs.[146] The service is built on closed captioning transcripts and allows users to search and stream 30-second video clips. Upon launch, the service contained "350,000 news programs collected over 3 years from national U.S. networks and stations in San Francisco and Washington D.C."[147] According to Kahle, the service was inspired by the Vanderbilt Television News Archive, a similar library of televised network news programs.[148] In contrast to Vanderbilt, which limits access to streaming video to individuals associated with subscribing colleges and universities, the TV News Search & Borrow allows open access to its streaming video clips. In 2013, the Archive received an additional donation of "approximately 40,000 well-organized tapes" from the estate of a Philadelphia woman, Marion Stokes. Stokes "had recorded more than 35 years of TV news in Philadelphia and Boston with her VHS and Betamax machines."[149]

Miscellaneous collections

[edit]

Brooklyn Museum collection contains approximately 3,000 items from Brooklyn Museum.[150] In December 2020, the film research library of Lillian Michelson was donated to the archive.[151]

Other services and endeavors

[edit]

Physical media

[edit]
A vintage wall intercom, an example of another "archived" item

Voicing a strong reaction to the idea of books simply being thrown away, and inspired by the Svalbard Global Seed Vault, Kahle now envisions collecting one copy of every book ever published. "We're not going to get there, but that's our goal", he said. Alongside the books, Kahle plans to store the Internet Archive's old servers, which were replaced in 2010.[152]

Vault

[edit]

Vault is a digital repository and preservation service provided by Internet Archive to institutions that need to preserve digital collections. Data stored in Vault is kept in at least 2 different Internet Archive datacenters, with at least 2 copies in each of them. Access control, fire protection and monitoring systems are used to protect all content stored in Vault.[77]

Software

[edit]

The Internet Archive has "the largest collection of historical software online in the world", spanning 50 years of computer history in terabytes of computer magazines and journals, books, shareware discs, FTP sites, video games, etc. The Internet Archive has created an archive of what it describes as "vintage software", as a way to preserve them.[153] The project advocated an exemption from the United States Digital Millennium Copyright Act to permit them to bypass copy protection, which the United States Copyright Office approved in 2003 for a period of three years.[154] The Archive does not offer the software for download, as the exemption is solely "for the purpose of preservation or archival reproduction of published digital works by a library or archive."[155] The Library of Congress renewed the exemption in 2006, and in 2009 indefinitely extended it pending further rulemakings.[156] The Library reiterated the exemption as a "Final Rule" with no expiration date in 2010.[157] In 2013, the Internet Archive began to provide select video games browser-playable via MESS, for instance the Atari 2600 game E.T. the Extra-Terrestrial.[158] Since December 23, 2014, the Internet Archive presents, via a browser-based DOSBox emulation, thousands of DOS/PC games[159][160][161][162] for "scholarship and research purposes only".[163][164][165] In November 2020, the Archive introduced a new emulator for Adobe Flash called Ruffle, and began archiving Flash animations and games ahead of the December 31, 2020, end-of-life for the Flash plugin across all computer systems.[166]

Table Top Scribe System

[edit]

A combined hardware software system has been developed that performs a safe method of digitizing content.[167][168]

Credit Union

[edit]

From 2012 to November 2015, the Internet Archive operated the Internet Archive Federal Credit Union, a federal credit union based in New Brunswick, New Jersey, with the goal of providing access to low- and middle-income people. Throughout its short existence, the IAFCU experienced significant conflicts with the National Credit Union Administration, which severely limited the IAFCU's loan portfolio and concerns over serving Bitcoin firms. At the time of its dissolution, it consisted of 395 members and was worth $2.5 million.[169][170]

Decentralization

[edit]

Since 2019,[171] the Internet Archive organizes an event called Decentralized Web Camp (DWeb Camp). It is an annual camp that brings together a diverse global community of contributors in a natural setting. The camp aims to tackle real-world challenges facing the web and co-create decentralized technologies for a better internet. It aims to foster collaboration, learning, and fun while promoting principles of trust, human agency, mutual respect, and ecological awareness.[172]

Wayforward Machine

[edit]
Screenshot of viewing English Wikipedia on the Wayforward Machine

On September 30, 2021, as a part of its 25th anniversary celebration, Internet Archive launched the "Wayforward Machine", a satirical, fictional website covered with pop-ups asking for personal information. The site was intended to depict a fictional dystopian timeline of real-world events leading to such a future, such as the repeal of Section 230 of the United States Code in 2022 and the introduction of advertising implants in 2041.[173][174]

Ceramic archivists collection

[edit]
Ceramic figures of Internet Archive employees

The Great Room of the Internet Archive features a collection of more than 100 ceramic figures representing employees of the Internet Archive, with the 100th statue immortalizing Aaron Swartz. This collection, inspired by the statues of the Xian warriors in China, was commissioned by Brewster Kahle, sculpted by Nuala Creed, and as of 2014, is ongoing.[175]

Artists in residence

[edit]

The Internet Archive visual arts residency,[176] organized by Amir Saber Esfahani, is designed to connect emerging and mid-career artists with the Archive's millions of collections and to show what is possible when open access to information intersects with the arts. During this one-year residency, selected artists develop a body of work that responds to and utilizes the Archive's collections in their own practice.[177]

[edit]
The main hall of the current headquarters

Opposition to national security letters, bills and settlements

[edit]

A national security letter issued to the Internet Archive demanding information about a user

On May 8, 2008, it was revealed that the Internet Archive had successfully challenged an FBI national security letter asking for logs on an undisclosed user.[183][184]

On November 28, 2016, it was revealed that a second FBI national security letter asking for logs on another undisclosed user was successfully challenged.[185]

The Internet Archive blacked out its website for 12 hours on January 18, 2012, in protest of the Stop Online Piracy Act and the PROTECT IP Act bills, two pieces of legislation in the United States Congress that they argued would "negatively affect the ecosystem of web publishing that led to the emergence of the Internet Archive". This occurred in conjunction with the English Wikipedia blackout, as well as numerous other protests across the Internet.[186]

The Internet Archive is a member of the Open Book Alliance, which has been among the most outspoken critics of the Google Book Settlement. The Archive advocates an alternative digital library project.[187]

Hosting of disputed media

[edit]

On October 9, 2016, the Internet Archive was temporarily blocked in Turkey after it was used (amongst other file hosting services) by hackers to host 17 GB of leaked government emails.[188][189]

Because the Internet Archive only lightly moderates uploads, it includes resources that may be valued by extremists and the site may be used by them to evade block listing. In February 2018, the Counter Extremism Project said that the Archive hosted terrorist videos, including the beheading of Alan Henning, and had declined to respond to requests about the videos.[190] In May 2018, a report published by the cyber-security firm Flashpoint stated that the Islamic State was using the Internet Archive to share its propaganda.[191] Chris Butler, from the Internet Archive, responded that they regularly spoke to the US and EU governments about sharing information on terrorism.[191] In April 2019, Europol, acting on a referral from French police, asked the Internet Archive to remove 550 sites of "terrorist propaganda".[192] The Archive rejected the request, saying that the reports were wrong about the content they pointed to, or were too broad for the organization to comply with.[192] On July 14, 2021, the Internet Archive held a joint "Referral Action Day" with Europol to target terrorist videos.[193]

A 2021 article said that jihadists regularly used the Internet Archive for "dead drops" of terrorist videos.[194] In January 2022, a former UCLA lecturer's 800-page manifesto, containing racist ideas and threats against UCLA staff, was uploaded to the Internet Archive.[195] The manifesto was removed by the Internet Archive after a week, amidst discussion about whether such documents should be preserved by archivists or not.[195] Another 2022 paper found "an alarming volume of terrorist, extremist, and racist material on the Internet Archive".[196] A 2023 paper reported that Neo-Nazis collect links to online, publicly available resources to be shared with new recruits. As the Internet Archive hosts uploaded texts that are not allowed on other websites, Nazi and neo-Nazi books in the Archive (e.g., The Turner Diaries) frequently appear on these lists. These lists also feature older, public domain material created when white supremacist views were more mainstream.[197]

2020 National Emergency Library

[edit]

In the midst of the COVID-19 pandemic which closed many schools, universities, and libraries, the Archive announced on March 24, 2020, that it was creating the National Emergency Library by removing the lending restrictions it had in place for 1.4 million digitized books in its Open Library but otherwise limiting users to the number of books they could check out and enforcing their return; normally, the site would only allow one digital lending for each physical copy of the book they had, by use of an encrypted file that would become unusable after the lending period was completed.[6] This Library would remain as such until at least June 30, 2020, or until the US national emergency was over, whichever came later.[198] At launch, the Internet Archive allowed authors and rightholders to submit opt-out requests for their works to be omitted from the National Emergency Library.[199][200][201]

The Internet Archive said the National Emergency Library addressed an "unprecedented global and immediate need for access to reading and research material" due to the closures of physical libraries worldwide.[202] They justified the move in a number of ways. Legally, they said they were promoting access to those inaccessible resources, which they claimed was an exercise in fair use principles. The Archive continued implementing their controlled digital lending policy that predated the National Emergency Library, meaning they still encrypted the lent copies and it was no easier for users to create new copies of the books than before. An ultimate determination of whether or not the National Emergency Library constituted fair use could only be made by a court. Morally, they also pointed out that the Internet Archive was a registered library like any other, that they either paid for the books themselves or received them as donations, and that lending through libraries predated copyright restrictions.[199][203]

The Archive had already been criticized by authors and publishers for its prior lending approach, and upon announcement of the National Emergency Library, authors, publishers, and groups representing both took further issue with The Archive and its Open Library project, equating the move to copyright infringement and digital piracy, and using the COVID-19 pandemic as a reason to push the boundaries of copyright.[201][204][205][206] After the works of some of these authors were ridiculed in responses, the Internet Archive's Jason Scott requested that supporters of the National Emergency Library not denigrate anyone's books: "I realize there's strong debate and disagreement here, but books are life-giving and life-changing and these writers made them."[207]

Access blocking in Indonesia

[edit]

On 27 May 2025, the Ministry of Communication and Digital Affairs of Indonesia (abbreviated as Komdigi) blocked access to the Internet Archive in Indonesia.[208] Alexander Sabar, the Director General of the Supervision of Digital Space, stated that the reason was the presence of pornography and online gambling on the site. He denied a rumor that there was motive to rewrite or hide history. He also acknowledged the importance of the Internet Archive, and claimed that the blocking was temporary and would be rescinded if they removed the offending content, and that they only blocked it after the Internet Archive did not respond to their requests.[209][210]

[edit]

In November 2005, free downloads of Grateful Dead concerts were removed from the site, following what seemed to be disagreements between some of the former band members. John Perry Barlow identified Bob Weir, Mickey Hart, and Bill Kreutzmann as the instigators of the change, according to an article in The New York Times.[211] Phil Lesh, a founding member of the band, commented on the change in a November 30, 2005, posting to his personal web site:

It was brought to my attention that all of the Grateful Dead shows were taken down from Archive.org right before Thanksgiving. I was not part of this decision making process and was not notified that the shows were to be pulled. I do feel that the music is the Grateful Dead's legacy and I hope that one way or another all of it is available for those who want it.[212]

A November 30 forum post from Brewster Kahle summarized what appeared to be the compromise reached among the band members. Audience recordings could be downloaded or streamed, but soundboard recordings were to be available for streaming only. Concerts have since been re-added.[213]

In February 2016, Internet Archive users had begun archiving digital copies of Nintendo Power, Nintendo's official magazine for their games and products, which ran from 1988 to 2012. The first 140 issues had been collected, before Nintendo had the archive removed on August 8, 2016. In response to the take-down, Nintendo told gaming website Polygon, "[Nintendo] must protect our own characters, trademarks and other content. The unapproved use of Nintendo's intellectual property can weaken our ability to protect and preserve it, or to possibly use it for new projects".[214]

In August 2017, the Department of Telecommunications of the Government of India blocked the Internet Archive along with other file-sharing websites, in accordance with two court orders issued by the Madras High Court,[215] citing piracy concerns after copies of two Bollywood films were allegedly shared via the service.[216] The HTTP version of the Archive was blocked but it remained accessible using the HTTPS protocol.[215]

In 2023, the Internet Archive became a popular site for Indians to watch the first episode of India: The Modi Question,[217] a BBC documentary released on January 17 and banned in India by January 20.[218][219] The video was reported to have been removed by the Archive on January 23.[217] The Internet Archive then stated, on January 27, that they had removed the video in response to a BBC request under the Digital Millennium Copyright Act.[220]

Book publishers' lawsuit

[edit]

The operation of the National Emergency Library was part of a lawsuit filed against the Internet Archive by four major book publishers—Hachette, HarperCollins, John Wiley & Sons, and Penguin Random House—in June 2020, challenging the copyright validity of the controlled digital lending program.[6][116][221] In response, the Internet Archive closed the National Emergency Library on June 16, 2020, rather than the planned June 30, 2020, due to the lawsuit.[222][223] The plaintiffs, supported by the Copyright Alliance,[224] claimed in their lawsuit that the Internet Archive's actions constituted a "willful mass copyright infringement."[225]

Judge Koeltl ruled on March 24, 2023, against Internet Archive in the case, saying the National Emergency Library concept was not fair use, so the Archive infringed their copyrights by lending out the books without the waitlist restriction. An agreement was then reached for the Internet Archive to pay an undisclosed amount to the publishers.[226] The Internet Archive appealed the ruling.[227][228] On September 4, 2024, the U.S. Court of Appeals for the Second Circuit upheld the district court's ruling, calling the Internet Archive's argument that they were shielded by fair use doctrine "unpersuasive".[229]

Music publishers' lawsuit

[edit]

In August 2023, the music industry corporations Universal Music Group (UMG), Sony Music and Concord sued the Internet Archive over its Great 78 Project, asserting the project was engaged in copyright infringement. The Great 78 Project stores digitized versions of pre-1972 songs and albums from 78 rpm phonograph records, for "the preservation, research and discovery of 78rpm records." The project had started in 2016, when pre-1972 recordings had not been protected by copyright; in 2018, the U.S. Congress passed the Music Modernization Act (MMA) which enabled legal remedies for unauthorized use of pre-1972 recordings until 2067, thus effectively covering them with copyright.[230]

UMG and Sony had been the two largest companies in this sector for more than a decade, with respective market shares of 31.8% and 22.1% in 2023.[231] Concord was a rapidly expanding music business closely partnered with UMG since its transformation into Concord Music Group in 2004[232] and backed since at least 2000 by J.P. Morgan.[233] It was the first music company to perform an asset-backed securitization, led by Apollo Global Management, in December 2022. Its assets consisted of over 1 million copyrights to music older than 18 months.[234][235] According to its CEO Bob Valentine, Concord derived about 85% of its revenue "from catalog, rather than newly developed, music". As Valentine stated in his first interview, "The phenomenon of artists' IP has never been more liquid; it is now a real and proven asset class. Investment bankers are focused on it, financiers are financing it, and then there's entities like us, that know how to buy rights, but also know how to manage them and have the relationships to do so."[232] The share of catalog music in total album equivalent consumption in the United States rose from 62.8% to 72.6% between 2019 and 2023.[236]

The publishers are seeking statutory damages for nearly 4,142 songs named in the suit, with a maximum possible fine of $621 million.[40] The Internet Archive has argued that the primitive sound quality of the original recordings falls within the doctrine of "fair use" to digitize for preservation, that the number of downloads is so small it has almost no impact on the publishers' revenue, and over 95% of the collection is not readily available anywhere else.[40] The plaintiffs said in response, "if ever there were a theory of fair use invented for litigation, this is it."[237] According to a legal source at Mayer Brown, the music publishers' case could be challenged as unconstitutional, since the granting of copyright to pre-1972 works in the MMA only benefitted record companies without having a systemic effect.[230]

Both parties submitted requests to drop the case on September 15, 2025, after reaching undisclosed settlement terms.[238]

See also

[edit]

Similar projects

[edit]

Other

[edit]

Notes

[edit]

References

[edit]

Further reading

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia

The Internet Archive is a 501(c)(3) non-profit organization founded in 1996 by computer engineer with the mission of providing universal access to all knowledge through the preservation and free distribution of digital content.
It operates the , a service that captures historical snapshots of websites, having preserved over 1 trillion web pages by October 2025, alongside extensive collections of digitized books, audio recordings, videos, software, and television broadcasts stored across more than 99 petabytes of data in redundant facilities.
The organization scans approximately 4,400 books daily, partners with over 1,250 institutions via Archive-It for curated web collections, and offers controlled digital lending through , serving millions of users worldwide and ranking among the top 300 most-visited websites.
Notable achievements include archiving television news since 2000, including pivotal events like the , and maintaining a congressional designation as a U.S. documents depository, while emphasizing user privacy by avoiding logging.
However, the Internet Archive has encountered major controversies, particularly over claims; in 2023, a federal court ruled its National Emergency Library and controlled digital lending of scanned books violated publishers' rights, a decision affirmed on appeal in 2024 without review, leading to the removal of millions of titles.
Additional lawsuits from record labels over digitized historical audio collections, seeking hundreds of millions in damages, culminated in a September 2025 settlement requiring further content restrictions.

History

Founding and Initial Projects (1996–2005)

The Internet Archive was established in 1996 as a 501(c)(3) non-profit organization by Brewster Kahle to systematically preserve digital cultural artifacts, with an initial emphasis on archiving the rapidly evolving World Wide Web, which lacked comprehensive preservation efforts at the time. Kahle, a computer engineer and digital librarian previously involved in projects like Wide Area Information Servers, recognized the ephemerality of online content and sought to create a digital library mirroring the scope of physical institutions like the Library of Congress. In April 1996, Kahle co-founded with Bruce Gilliat, a web crawling service that collected data on internet usage and donated its crawl archives to the Internet Archive, enabling the initial accumulation of web snapshots starting that year. These early crawls formed the foundation of the web archive, capturing pages without sophisticated tools but prioritizing comprehensive coverage over perfection. The , the public interface for accessing these archived web pages, was launched in October 2001, allowing users to view historical versions of websites dating back to by entering URLs and selecting dates. By its debut, the system had indexed billions of pages, though access was limited to non-commercial research use initially to manage server loads and respect site owners' preferences. During this period, the Archive expanded beyond web content; in 2000, it initiated television archiving by capturing broadcast signals, with the first public release in 2001 focusing on news coverage of the . In 2005, the organization began digitizing books through scanning partnerships, marking the start of efforts to preserve print media in digital form for broader accessibility. These projects reflected Kahle's vision of universal access to while navigating technical constraints and the absence of standardized protocols.

Growth and Expansion (2006–2019)

In 2006, the Internet Archive launched Archive-It, a subscription service enabling libraries, museums, and other institutions to create and manage their own web archives, starting with 18 inaugural partners. By 2016, Archive-It had expanded to over 450 partners and facilitated the capture of 17 billion URLs, supporting targeted archiving of historical events and organizational records. Concurrently, the organization initiated large-scale book digitization efforts, establishing scanning centers worldwide to convert physical volumes into digital formats. The project, announced by on July 16, 2007, aimed to create a comprehensive web-based catalog of books with lending capabilities, building on the growing digital book collection. By 2010, the Internet Archive made one million digitized books available specifically for users with print disabilities, emphasizing accessibility in its expansion. operations scaled significantly, reaching capacities that supported the addition of millions of volumes to accessible repositories by the mid-2010s. In 2009, the TV News Archive was established, capturing and preserving broadcasts from major U.S. networks to enable searchable access to historical footage via captions. This initiative expanded in 2012 with the launch of TV News Search & Borrow, providing public tools to query over 350,000 broadcasts and borrow segments for research. growth paralleled these projects; by October 2012, the Archive had stored 10 petabytes of cultural materials, reflecting investments in scalable storage solutions like custom server racks. Further diversification occurred in 2013 with the introduction of the Historical Software Archive, preserving vintage computer programs and emulations to safeguard digital heritage. By 2019, the organization's collections encompassed hundreds of petabytes across web snapshots, books, audio, video, and software, supported by over 1,250 institutional partners via Archive-It and global digitization sites scanning thousands of items daily. This period marked a shift from web-focused archiving to a multifaceted , driven by technological advancements and collaborative efforts.

Challenges and Milestones (2020–2025)

In March 2020, amid the COVID-19 pandemic, the Internet Archive launched the National Emergency Library, temporarily suspending waitlists for over 1.4 million e-books to facilitate remote access, arguing it mirrored physical library lending under controlled digital lending principles. Publishers including Hachette Book Group, HarperCollins, Penguin Random House, and John Wiley & Sons filed a lawsuit on June 1, 2020, in the U.S. District Court for the Southern District of New York, alleging the program constituted willful mass copyright infringement by enabling simultaneous digital access beyond owned copies. The library ended the initiative two weeks early on June 16, 2020, reverting to traditional one-user-at-a-time lending. The broader lawsuit challenged the Internet Archive's controlled digital lending of scanned books, with the district court ruling on March 24, 2023, that it did not qualify as , as the reproductions served as market substitutes harming publishers' licensing revenues rather than transformative preservation. The U.S. Court of Appeals for the Second Circuit affirmed this on September 4, 2024, holding that the digital copies were not reasonably necessary for or and competed directly with authorized e-book sales. On December 4, 2024, the Internet Archive opted against Supreme Court review, agreeing to remove approximately 500,000 titles and limit access, marking a significant curtailment of its program and raising ongoing questions about versus enforcement. October 2024 brought severe operational disruptions from cyberattacks, beginning with a DDoS on October 9 that knocked services offline for hours, followed by a exposing a database of 31 million user emails, usernames, and salted-encrypted passwords. Additional incidents included via a compromised and a third breach on October 20, prompting read-only mode for the by October 13 and partial restoration by October 21. These events exposed vulnerabilities in the organization's infrastructure, with no attributed perpetrators but highlighting risks to irreplaceable digital collections. Amid these setbacks, the Internet Archive achieved a major preservation milestone in October 2025, surpassing 1 trillion web pages archived in the , encompassing over 100 petabytes of data captured since 1996 and underscoring its role in safeguarding web history despite legal and technical hurdles. This benchmark, celebrated with calls for libraries to recognize web memory's importance, reflects sustained crawling efforts even as access models faced constraints from litigation.

Cyberattacks and Security Breaches

In October 2024, the Internet Archive experienced a series of cyberattacks, including distributed denial-of-service (DDoS) attacks and a significant . The initial DDoS assault began on October 8, 2024, and was claimed by a hacking group, rendering services such as Archive.org and OpenLibrary.org inaccessible for several hours. This attack peaked with sustained traffic volumes that overwhelmed the organization's infrastructure, leading to downtime exceeding three hours on October 9. Concurrently, on October 9, 2024, a compromised the user authentication database for the , exposing approximately 31 million records including addresses, usernames, and salted, encrypted passwords. The breach also involved through injection into a , though the organization stated that the DDoS and were not believed to be connected. In response, the Internet Archive took sites offline for security assessments, restoring the in read-only mode by October 13, 2024, while full functionality was gradually reinstated. Further incidents followed, with a third security breach confirmed on October 20, 2024, amid escalating threats that included additional DDoS waves and exploitation of third-party services for emails to patrons. By 2024, the reported recurring DDoS attacks occurring periodically, prompting adaptations such as enhanced defenses against a more hostile cyber environment. No major prior cyberattacks on the Internet Archive were publicly documented on the scale of these 2024 events, highlighting vulnerabilities in its nonprofit operations.

Organizational Structure

Leadership and Governance

The Internet Archive operates as a 501(c)(3) , founded in 1996 by , who serves as its Digital Librarian and Chairman of the Board. Kahle, a computer engineer and entrepreneur previously involved in developing the Wide Area Information Servers (WAIS) protocol, established the entity to create a preserving cultural artifacts and providing "universal access to all knowledge." Governance is provided by a board of directors, which oversees strategic direction, financial accountability, and compliance with nonprofit regulations. As of September 2025, the board includes Kahle as chair, alongside David Rumsey, a cartographer and major donor of historical maps to the Archive's collections, and Kathleen Burch, a philanthropist and co-founder of the Wellspring Foundation focused on education and community initiatives. The board's composition emphasizes individuals with expertise in digital preservation, philanthropy, and archival domains, reflecting the organization's mission-driven priorities over commercial interests. Day-to-day leadership falls under Kahle, who directs core operations including via the and expansion of digitized collections. Specialized directors, such as those for open libraries and programs, report into this structure, supporting initiatives like controlled digital lending amid ongoing legal challenges from publishers alleging . The nonprofit status ensures decisions prioritize public access over profit, though critics have questioned governance transparency during lawsuits, such as , where board oversight of lending practices came under scrutiny without evidence of malfeasance.

Funding Sources and Financial Sustainability

The Internet Archive, a 501(c)(3) , derives its funding primarily from contributions including individual donations and foundation grants, as well as revenue from program services such as and digitization provided to partners. In its 2023 , contributions accounted for approximately 68% of total revenue at $16.1 million, while program service revenue contributed 31% or $7.3 million. These streams support operations managing over 175 petabytes of archived data, with funding enabling free public access to collections. Notable grants have come from foundations including the Hewlett Foundation ($3.15 million across 2003, 2006, and 2017), the Knight Foundation ($1.85 million from 2012 to 2016), and the Andrew W. Mellon Foundation (including $942,000 from 2006 to 2018 and a $750,000 grant in 2024 for community web archiving expansion). Other significant donations include $2 million from the Pineapple Fund in 2017 and $1.93 million from Arnold Ventures in 2015. The organization also benefits from in-kind donations of materials and relies on recurring individual contributions to sustain daily operations serving millions of users. Financial data from IRS filings reveal fluctuating revenue and rising expenses, with a notable deficit in recent years:
YearTotal RevenueTotal Expenses/(Loss)Net Assets
2023$23,678,074$32,674,667-$8,996,593-$3,530,018
$30,547,311$25,827,598$4,719,713$4,212,232
$29,414,365$25,327,789$4,086,576$3,099,999
Expenses surged 26% from to 2023, driven by operational scaling and legal costs, eroding prior surpluses and resulting in negative net assets. Financial sustainability faces pressures from escalating storage and preservation costs for vast digital collections, alongside multimillion-dollar lawsuits that have imposed operational restrictions and potential liabilities. In v. Internet Archive (2023, affirmed 2024), courts ruled the organization's controlled digital lending violated by substituting for licensed e-books, leading to the removal of over 500,000 titles and undermining a core revenue-adjacent model. Ongoing litigation, including a 2025 settlement with music publishers over the Great 78 Project and a separate $700 million claim, further strains resources amid reliance on volatile donations rather than diversified income. These factors, combined with technical demands of replaying archived content, heighten risks to long-term viability without expanded grants or service contracts.

Technical Operations

Archiving Methodologies

The Internet Archive's archiving methodologies encompass automated web crawling, manual of , and of user-submitted digital files to ensure comprehensive preservation. Web content is primarily captured using , an open-source crawler developed by the organization, which performs web-scale harvests by following hyperlinks, respecting directives, and storing snapshots in the Web ARChive (WARC) format to retain metadata, payloads, and structural elements for replayability. employs modular components for scheduling, politeness throttling to mitigate server load, and handling of diverse content types, including dynamic elements where feasible, enabling both broad internet-wide crawls and targeted collections via partnerships. Physical books and texts are digitized through the proprietary system, a non-destructive scanning featuring dual overhead cameras, automated v-shaped cradles to minimize spine stress, and software-driven image processing to capture pages at resolutions up to 400 DPI while correcting for and finger occlusion. Operators manually turn pages and align books, allowing the facility to process approximately 3,500 volumes daily across global partner sites, with post-processing generating searchable PDFs and derived formats like DAISY for accessibility. Audio materials, particularly analog formats like vinyl records, undergo real-time digitization on arrays of synchronized turntables equipped with high-fidelity needles and amplifiers, capturing full sides in 20-minute sessions per LP to preserve surface noise and dynamic range characteristic of original pressings. This method, scaled across 12 or more units, facilitates batch processing while avoiding acceleration artifacts, supplemented by digital uploads where contributors provide uncompressed source files for automated derivative creation in multiple bitrates. Television broadcasts are archived via continuous capture of U.S. national feeds from cable and over-the-air sources starting June 2009, employing server-based tuners and encoding pipelines to record programs in their entirety, with closed-caption data extracted for full-text searchability across millions of hours of footage. These methodologies prioritize fidelity and completeness, integrating checks and metadata to support long-term and utility.

Infrastructure and Scalability

The Internet Archive maintains its core infrastructure across data centers featuring approximately 750 physical servers supporting 1,300 virtual machines, which manage over 30,000 storage devices including more than 20,000 spinning hard disk drives arranged in 75 racks. Data is mirrored across drives and multiple data centers to ensure redundancy and availability. This setup utilizes around 20,000 disk drives, with configurations such as 36 drives per data node, enabling the handling of vast archival loads through distributed storage systems. As of October 2025, the organization's total data holdings exceed 150 petabytes, encompassing web archives, digitized books, audio, video, and software collections. The alone accounts for over 100 petabytes, having archived one trillion web pages by adding roughly 500 million pages daily. Storage capacity has expanded significantly from 70 petabytes in December 2020, driven by ongoing acquisitions of hardware funded primarily through donations. These expansions include modular additions like containerized data centers, such as a 20-foot unit housing 63 server clusters providing 4.5 petabytes of initial capacity. Scalability is achieved through , , and compression techniques that optimize storage efficiency amid in archived content. However, this expansion faces challenges including high operational costs for servers, bandwidth, and power consumption, estimated to require substantial annual to sustain petabyte-scale storage. Reliance on donor-supported hardware limits rapid scaling, while the need for continuous mirroring and redundancy increases complexity in managing across facilities. Despite these hurdles, the supports daily ingestion of millions of items, reflecting adaptive strategies to accommodate the internet's burgeoning volume. Downloads from the Internet Archive are commonly slow, with users reporting typical speeds of 100–800 KB/s (or 0.8–6.4 Mbps), regardless of their internet connection. This stems from high user demand, limited server bandwidth, and rate limiting to manage load, as the non-profit prioritizes accessibility over high-speed delivery for large files.

Web Archiving

Wayback Machine

The Wayback Machine is a service provided by the Internet Archive that enables users to access archived versions of web pages from various points in time, preserving a historical record of the . It operates by systematically crawling the to capture publicly available content, storing snapshots that can be retrieved by entering a and selecting a specific date. Launched publicly in 2001 after initial archiving efforts began in 1996, the service had already accumulated over 10 billion archived pages by its debut, reflecting the rapid growth of web content at the time. Web crawling for the Wayback Machine relies on open-source software such as Heritrix, an extensible, archival-quality crawler designed for large-scale operations. This process starts with seed URLs, typically popular sites, from which the crawler follows hyperlinks to discover and download additional pages, prioritizing publicly accessible data while respecting robots.txt directives where implemented. Captured content is stored in WARC (Web ARChive) format, which encapsulates the full HTTP transaction including headers, metadata, and payloads, ensuring fidelity to the original presentation. The system indexes these archives to allow temporal queries, reconstructing pages as closely as possible to their live state, though dynamic elements like JavaScript-generated content or paywalled material may not fully render in older snapshots. Users interact with the Wayback Machine through its web interface at web.archive.org, where they can search by to view a calendar of available captures or use keyword searches across archived sites. Additional features include "Save Page Now," which allows on-demand archiving of current pages via browser extensions or calls, and advanced APIs for programmatic access to capture data and availability timelines. The service supports , , and legal by providing verifiable historical records, with captures often admissible in under business records exceptions despite occasional challenges. By October 2025, the Wayback Machine had preserved over one trillion web pages, marking a significant milestone in digital preservation and establishing it as the largest public repository of web history. This scale underscores its role in combating link rot, where an estimated 25% of web pages cited in academic literature become inaccessible within four years. However, archiving activity faced disruptions in 2025, with snapshots of major news site homepages dropping sharply from 1.2 million between January and May to just 148,628 from May to October, attributed to breakdowns in partnered crawling projects rather than technical failures. Legal scrutiny has occasionally targeted Wayback captures, including debates over blocking crawlers to prevent unauthorized archiving or AI training data extraction, though no major shutdowns have occurred specific to web archiving operations.

Specialized Web Collections

The Internet Archive develops specialized web collections through selective, partner-driven crawling efforts that target specific domains, events, organizations, or themes, distinct from the comprehensive, automated snapshots of the Wayback Machine. These collections prioritize curated preservation of culturally significant or institutionally relevant online content, such as government records, non-profit websites, and ephemeral event pages, using tools like the to capture and index materials on demand. A primary mechanism for these collections is the Archive-It service, launched in February 2006 as a subscription-based platform enabling libraries, archives, museums, and other entities to build and manage their own web archives. By 2014, Archive-It supported 326 partner organizations in creating 2,700 public collections; as of recent data, it encompasses over 1,200 partners across more than 45 countries and exceeds 10,000 collections. Partners define "seeds"—starting URLs—for crawls, apply metadata for organization, and access features like , playback interfaces, and data export in formats such as WARC files for long-term preservation. This approach addresses gaps in broad crawls, such as dynamic content or sites requiring permissions, while ensuring compliance with legal mandates like records retention for public agencies. Notable examples include the Community Webs program, which archives local historical and community-oriented sites, with metadata from over 4,800 websites integrated into platforms like the as of September 2022. Specialized thematic collections cover crises, capturing more than 21,000 resources related to events like pandemics since 2014; disaster responses, such as wildfire documentation; and institutional records, including university and state agency publications. The Special Collection, preserved after the service's shutdown, exemplifies domain-specific rescues, safeguarding nearly 15 years of user-generated personal web pages. These efforts enhance research accessibility, with tools for applied to collections for analytical datasets. Archive-It collections often involve collaborative crawls for spontaneous events, such as the 2011 Japanese earthquake response, and educational initiatives like K-12 web archiving programs, fostering a distributed network of preservation. By emphasizing user control and curation, the service mitigates limitations of automated archiving, such as incomplete captures of JavaScript-heavy sites, though it relies on partner subscriptions for sustainability and may exclude paywalled or restricted content without explicit inclusion.

Digital Libraries

Books and Texts

The Internet Archive's Books and Texts collection encompasses over 47 million digitized items, including books, journals, microforms, archival materials, maps, diaries, and photographs, available in more than 184 languages. Launched on December 16, 2004, the collection features over 20 million freely downloadable books, primarily works, alongside 2.3 million modern eBooks available for borrowing with a free account. Digitized books exceed 4 million volumes, sourced through partnerships with over 1,100 libraries and institutions since 2005. The , a project of the Internet Archive, serves as an open catalog of over 20 million book records, compiling editions and works from institutional catalogs and user contributions to facilitate universal access to published human knowledge. It integrates with the Books and Texts collection to enable searching, borrowing, and metadata enhancement, supporting formats like PDF, , and DAISY files for . Books are acquired and digitized via non-destructive scanning processes using custom machines, which capture pages one at a time without removing bindings, at over 33 global centers across four continents. The Internet Archive approximately 3,500 books daily through these efforts, often in collaboration with libraries sending physical copies for conversion into searchable digital texts via . Post-scanning, items undergo quality checks and metadata assignment before upload. The lending model employs Controlled Digital Lending (CDL), where one digital copy circulates at a time corresponding to owned physical holdings, with loans lasting 14 days or one hour for in-browser reading, limited to 10 books per user. Following a 2023 federal court ruling in Hachette v. Internet Archive, which found the practice violated copyright for certain titles, over 500,000 books were removed from lending availability in 2024, though millions of public domain and other volumes remain accessible. Publishers argued CDL exceeded fair use by enabling unauthorized reproductions and distributions, a position upheld on appeal in 2024.

Audio and Music Collections

The Internet Archive's Audio Archive encompasses millions of digitized sound recordings, including music, spanning genres from historical 78 rpm discs to contemporary live performances, with over 13 million items stored across 2.7 petabytes as of late 2025. These collections emphasize preservation of and openly licensed materials, alongside user-contributed content under , enabling free streaming and downloads in formats such as , , and MP3. A cornerstone of the music holdings is the Live Music Archive, launched in 2002, which curates over 250,000 concert recordings exceeding 250 terabytes, primarily in lossless audio. This ad-free repository features fan-sourced and officially approved live sets from artists including the , with monthly uploads averaging around 1,000 items and coverage dating to 1959. Contributions rely on permissions from performers or estates, focusing on non-commercial dissemination to document musical history without supplanting studio releases. The Great 78 Project, a collaborative effort initiated in the , targets the preservation of approximately 250,000 pre-1964 78 rpm singles—equating to 500,000 songs—from labels like Victor and Columbia, capturing early , , and popular recordings often absent from modern catalogs. Volunteers and partner institutions scanned and processed these discs, retaining original surface noise to maintain authenticity, with thousands made publicly accessible until legal challenges arose. In March 2025, major labels including filed suit alleging mass via the project's hosting of post-1923 recordings still under protection, prompting the removal of nearly 500 disputed tracks and a September 2025 settlement that preserved the initiative's core focus while resolving claims for $621 million in potential damages. Additional music-oriented subsets include Community Audio, rebranded in from Open Source Audio to accommodate user-uploaded original tracks, podcasts, and netlabel releases—electronic and experimental music distributed freely by independent labels—and the 78 RPMs and Cylinder Recordings collection, which archives pre-electric era artifacts like Edison cylinders from the 1890s onward. These efforts collectively prioritize archival integrity over commercial viability, though they have drawn criticism from rights holders for potentially undermining licensing markets, a contention the Archive counters by highlighting gaps in commercial preservation of niche or obsolete formats.

Visual and Moving Image Archives

![TV tuners used for capturing broadcasts at the Internet Archive][float-right] The Internet Archive's Moving Image Archive, launched on February 26, 2005, hosts over 14 million digital video files encompassing a wide range of content including classic full-length films, news broadcasts, cartoons, concerts, and user-uploaded videos. This collection spans 23.4 petabytes of storage and includes materials digitized from archival sources as well as contributions from users worldwide, with a focus on public domain works and ephemeral media at risk of loss. Notable sub-collections feature educational films, home movies, and alternative news footage, aimed at preserving visual history for public access and research. A key component is the TV News Archive, initiated in 2009, which captures and stores U.S. broadcast television programs for non-commercial, educational purposes. As of 2024, it includes over 3 million broadcasts from major networks, searchable via transcripts, totaling millions of hours of footage dating back to the archive's start. The archive employs automated recording through TV tuners to document daily news cycles, enabling researchers to analyze historical events, media trends, and public discourse without relying on potentially selective commercial archives. Specialized subsets, such as the 9/11 TV News Archive with 3,000 hours from 20 international channels covering the attacks and immediate aftermath, highlight its role in event-specific preservation. Preservation efforts extend to physical media conversion, including videotapes and films, to prevent degradation of analog formats. The archive prioritizes open access, allowing downloads and streaming, though access to some recent TV content requires borrowing privileges to respect broadcaster agreements. These initiatives underscore the Internet Archive's commitment to safeguarding moving images against , with cumulative views exceeding 9 billion as of recent counts.

Software and Miscellaneous Holdings

The Internet Archive's software holdings form one of its most comprehensive digital preservation efforts, encompassing the largest collection of vintage and historical programs worldwide, with over 1.3 million items stored across 1.5 petabytes and comprising 28.5 million files. These include , , demos, applications, utilities, games, and operating systems from platforms spanning the 1980s to early 2000s, such as , , Atari 800, , and early distributions. Disk images, ISOs, and files are archived to enable preservation of original formats, with subcollections like the TOSEC database providing over 450,000 images (3.6 terabytes) for retrocomputing emulation across multiple systems. Emulation capabilities allow in-browser execution of much of the collection, utilizing tools such as for titles and JSMESS for other platforms, supporting over 250,000 playable software items as of September 2023. Dedicated subcollections highlight specific eras and genres, including over 4,000 classic via DEMU, thousands of entertainment and strategy titles, and curated historical packages selected for cultural or technical significance. Over 2,500 CD-ROMs are preserved as ISO images, reflecting the distribution methods of pre-internet software dissemination. Miscellaneous holdings complement these efforts with additional digital ephemera, such as dormant FTP site mirrors, real-time captures, high-score replays, and previews from defunct archives. The collection also incorporates repositories, Flash animations and games via the Flash Showcase (curated for historical representation of browser-based media), and video news releases bundled with software artifacts. These items, often sourced from user contributions or recovered mirrors, emphasize preservation of transient like early web demos and supplements, totaling additional terabytes integrated into the broader software ecosystem.

Book Scanning and Lending Litigation

The Internet Archive's book scanning and lending practices came under legal scrutiny in June 2020 when four major publishers—Hachette Book Group, HarperCollins Publishers, John Wiley & Sons, and Penguin Random House—filed a copyright infringement lawsuit against the organization in the U.S. District Court for the Southern District of New York (Hachette Book Group, Inc. v. Internet Archive). The suit targeted the Archive's Controlled Digital Lending (CDL) program, under which the nonprofit scans physical books from its collection and lends digital copies to users on a one-to-one basis, mimicking traditional library lending by ensuring only one digital copy circulates at a time for each physical volume owned. It also challenged the National Emergency Library (NEL), a temporary initiative launched in March 2020 amid the COVID-19 pandemic that suspended the one-patron-at-a-time limit, allowing simultaneous borrowing of digital scans until June 2020. The publishers alleged that the Archive's scanning of over 1.5 million books without permission and their subsequent constituted direct infringement, arguing that CDL does not qualify as because it serves as a market substitute for licensed rather than a transformative purpose. The Internet Archive defended the practice as under Section 107 of the Act, contending that digital lending preserves access to knowledge akin to physical libraries, adds value through searchability and preservation, and does not harm ebook markets given the limited borrowing periods (typically 14 days) and the predominance of out-of-print titles. Supporting the Archive, organizations like the emphasized CDL's role in equitable access, particularly for underserved users, while critics, including the , highlighted potential lost licensing revenue and unauthorized dissemination. On March 24, 2023, U.S. District Judge ruled in favor of the publishers on , finding that the Archive's activities failed all four factors: the scans were non-transformative copies of creative works, primarily for commercial substitution rather than criticism or ; they targeted the core protected elements of books; and they caused cognizable market harm by diverting potential sales and licensing, especially for in-print titles. The rejected the analogy, noting that digital copies lack the physical constraints of lending and enable perfect reproductions that compete directly with authorized digital editions. The Internet Archive appealed to the Second Circuit of Appeals, which unanimously affirmed the district 's decision on September 4, 2024, in an opinion emphasizing that the lending model undermined publishers' incentives to invest in digital markets without providing new expressive content or functionality. In December 2024, the Internet Archive announced it would not seek U.S. review, effectively ending the litigation and committing to remove approximately 500,000 commercially available titles from its lending program in accordance with a prior settlement agreement with the Association of American Publishers. The ruling has broader implications for , prompting libraries to reconsider CDL implementations and reinforcing publishers' control over ebook distribution, though the Archive maintains that it will continue lending and permissively shared works while advocating for legislative reforms to support controlled digital access. In 2023, major record labels including Recordings, , Concord Musical Group, Entertainment, and Arista Music filed a against the Internet Archive in the United States District Court for the Southern District of New York, targeting the organization's Great 78 Project. The project, launched to preserve early 20th-century audio by digitizing fragile 78 rpm records—many donated by the public and featuring artists such as , , and —involves crowdsourced scanning and public streaming of over 250,000 sides from approximately 5,000 artists, with the goal of preventing loss of irreplaceable cultural artifacts not otherwise commercially reissued. The plaintiffs alleged that the Internet Archive operated an "illegal record store" by willfully streaming more than 4,000 pre-1972 recordings without licenses, thereby depriving labels of licensing revenue from modern streaming platforms and violating federal , including the protection of recordings fixed before February 15, 1972, under state and subsequent federal extensions. The Internet Archive defended the project as non-commercial preservation work qualifying under fair use doctrine, arguing that the recordings—often "orphan works" with unclear ownership or no active market exploitation—posed no substantial harm to labels' incentives, given their rarity in catalogs and the minimal streaming volumes compared to licensed services like . The organization emphasized first-come, first-served digitization of public donations, with takedown compliance for verified claims, and contended that the suit threatened broader digital heritage efforts by prioritizing revenue over accessibility of pre-digital era media vulnerable to physical decay. Labels countered that even low-volume streams eroded their exclusive rights, estimating damages at up to $150,000 per infringed work, initially seeking around $400 million and later amending to $621 million across the contested tracks, while dismissing as inapplicable to systematic and distribution. In April 2024, the court denied the Internet Archive's motion to dismiss, allowing the infringement claims to proceed on grounds that the pleadings sufficiently alleged unauthorized and beyond transformative or archival exceptions. The case highlighted tensions between copyright maximalism and cultural preservation, with critics of the labels noting that many 78-era masters remain unremastered or unavailable due to commercial disinterest, potentially justifying access under doctrines like or abandonment, though have historically upheld owners' control over pre-1972 recordings absent explicit statutory exemptions. On September 15, 2025, the parties reached a confidential settlement, notifying the of resolution without admission of liability or disclosure of terms, financial payments, or changes to the project's operations, thereby concluding the litigation amid ongoing debates over works reform and the scope of in nonprofit archiving. No additional major music preservation suits against the Internet Archive have advanced to similar prominence, though the Great 78 settlement underscores persistent challenges in balancing proprietary claims with empirical needs for safeguarding obsolete formats against .

Other Intellectual Property Conflicts

The Internet Archive has encountered intellectual property disputes involving software preservation, where hosting emulated programs and game ROMs has prompted DMCA takedown notices from copyright holders. These notices, issued under the , compel removal to maintain safe harbor protections, as seen in cases involving vintage video games from companies like , resulting in the deletion of hosted emulation files. The Archive relies on periodic DMCA exemptions granted by the U.S. Copyright Office for archiving obsolete software formats requiring original hardware or damaged protection mechanisms, such as dongles, but efforts to expand these for broader were rejected in October 2024, limiting legal circumvention of access controls. In the realm of web archiving via the , the Internet Archive has faced copyright claims asserting that capturing and making available snapshots of copyrighted webpages constitutes infringement, particularly when sites include images, videos, or proprietary content. Website owners can request exclusions via directives or submit DMCA notices for specific archived pages, which the Archive processes to avoid liability, though it defends non-commercial preservation as for historical and evidentiary purposes, such as in legal proceedings. A notable early conflict arose in , when the Archive settled a alleging and over archived web content, agreeing to undisclosed terms without admitting wrongdoing. Disputes over visual and moving image holdings, including films and television captures, have similarly triggered DMCA takedowns for non-public domain materials, with the Archive removing content upon valid claims while arguing for research and cultural preservation. These incidents highlight ongoing tensions between the Archive's mission and rights holders' enforcement, often resolved through compliance rather than litigation, but underscoring vulnerabilities in hosting diverse digital artifacts without explicit permissions.

Controversies and Criticisms

The Internet Archive has faced multiple allegations of systematic , primarily centered on its digital lending practices and unauthorized of protected works. In June 2020, four major publishers—, Publishers, John Wiley & Sons, and —filed a in the U.S. District Court for the Southern District of New York, accusing the Archive of willful through its program, which scans physical books it owns and lends digital copies on a one-to-one basis via controlled digital lending (CDL). The suit escalated with the Archive's National Emergency Library initiative, launched in March 2020 amid the , which temporarily suspended lending waitlists to allow unlimited simultaneous digital checkouts of over 1.4 million scanned books, prompting claims that this model directly competed with authorized e-book sales and licensing markets without permission or compensation. In March 2023, the district court ruled that the Archive's CDL practices did not qualify as under Section 107 of the Copyright Act, finding they failed the and market harm factors by reproducing complete works without adding new expression or insight, thus supplanting publishers' licensing revenues for in-copyright titles. The U.S. Court of Appeals for the Second Circuit affirmed this decision on September 4, 2024, in a 64-page opinion rejecting the Archive's defenses and emphasizing that mass and lending of entire books harmed the for digital editions, even if physical copies were owned. The Archive opted not to seek review by December 2024, leading to a consent judgment requiring removal of scanned copies of the plaintiffs' works from its systems, though it maintained that CDL aligns with traditional library lending under principles extended to digital formats. Separately, in 2023, major record labels including Universal Music Group, Sony Music Entertainment, and Capitol Records (representing the RIAA) sued the Archive in federal court, alleging copyright infringement via the Great 78 Project, which digitized, streamed, and downloaded over 4,000 pre-1972 sound recordings from 78rpm shellac discs without licenses, including works by artists like Frank Sinatra and Chuck Berry. The complaint sought statutory damages potentially exceeding $400 million initially, later amended to include additional tracks pushing claims toward $700 million, framing the project as an "illegal record store" that enabled unauthorized public access and distribution. By September 2025, the parties entered a settlement resolving claims over streaming of vintage recordings, with terms undisclosed but requiring the Archive to address unauthorized reproductions, highlighting tensions between preservation efforts and rights holders' control over legacy audio markets. These cases underscore broader allegations that the Archive's "free digital library" model circumvents law by prioritizing unrestricted access over licensing, with critics including publishers and labels arguing it undermines incentives for new by eroding streams—evidenced by the publishers' claims of lost e-book during the NEL period—while supporters, including some librarians and advocates, contend it emulates physical functions without net market harm. No criminal charges have resulted, but the rulings have prompted the Archive to delist thousands of titles and face ongoing scrutiny over its handling of in-copyright materials in other collections, such as software emulation and television captures.

Content Hosting and Access Restrictions

The Internet Archive hosts digitized content including web snapshots, books, audio recordings, and software, making it publicly accessible via platforms like the and , but implements removal procedures in response to (DMCA) notices for alleged infringement. Upon receiving a valid DMCA takedown request, the organization expeditiously removes or disables access to the specified material, as outlined in its copyright policy, and terminates accounts of repeat infringers. This compliance has led to the excision of substantial holdings, such as over 500,000 books from following the 2023 district court ruling in , which rejected the organization's defense for uncontrolled digital lending of scanned copyrighted works. Critics from preservation communities argue that such removals, particularly when initiated by copyright holders rather than site owners, undermine the archival mission by selectively erasing , as evidenced by the Internet Archive's handling of user-uploaded or crawled content without initial proactive restrictions. In contrast, rights holders contend that the platform's hosting of unauthorized copies—often without owned physical originals for all items—facilitates widespread infringement, prompting demands for stricter upfront access controls beyond reactive takedowns. The organization's reliance on claims for hosting has been invalidated in federal courts, affirming that systematic digital reproduction and distribution exceed transformative or limited-use exceptions. Access to hosted content is further restricted by adherence to robots.txt directives, which site operators use to exclude pages from crawling and subsequent indexing, effectively preventing archival preservation and public retrieval of those materials. External platforms have imposed blocks, such as Reddit's August 2025 decision to restrict Internet Archive crawlers amid concerns over AI data scraping, limiting future archiving of subreddit content. Controversial cases include the September 2022 removal of forum archives from the , prompted by harassment-related rather than claims, which preservationists criticized as a policy shift toward content-based exclusions inconsistent with prior tolerance for sites like . While the Internet Archive's access promotes non-discriminatory, open availability, practical limitations arise from legal obligations and partner pressures, balancing preservation against infringement liabilities.

Economic Effects on Creators and Markets

Publishers and authors have argued that the Internet Archive's (IA) controlled digital lending of scanned books undermines from sales and licensing, serving as a direct substitute for paid access. In the 2020 lawsuit v. Internet Archive, plaintiffs including , , , and Wiley claimed IA's program, which lent digital copies of over 1.5 million books, harmed their primary markets by offering free, unlimited borrowing during the National Emergency Library phase in 2020 and beyond. A federal district court ruled in March 2023 that IA's practices exceeded , explicitly finding market harm to publishers' and print offerings, as the free digital copies competed with licensed . This decision was upheld unanimously by the Second Circuit Court of Appeals on September 4, 2024, affirming that IA's lending model negatively impacts creators' economic incentives by bypassing permission-based streams. The , representing writers, has contended that IA's model deprives authors of royalties tied to sales and library licensing, where publishers often charge per-circulation fees—potentially eroding incomes in an industry where author earnings are already modest, with median advances around $5,000–$10,000 for many titles. While publishers reported surging profits during the period (e.g., U.S. book sales up 20% in 2021 amid pandemic demand), they maintained that IA's unauthorized copies cannibalize potential digital revenue, a claim the courts accepted without requiring precise quantification of lost sales, relying instead on the inherent in unrestricted free access. Empirical studies specifically measuring IA's sales impact remain scarce, though general research on digital piracy indicates substitution rates of 10–30% for , suggesting analogous economic displacement for creators reliant on downstream royalties. In the music sector, major record labels including Universal Music Group, Sony Music Entertainment, and Capitol Records sued IA in October 2024 over its Great 78 Project, which digitized and streamed over 5,000 pre-1972 recordings from 78rpm shellac discs without licenses, alleging infringement that deprived them of streaming royalties and licensing fees. Labels sought up to $621 million in statutory damages—calculated at $150,000 per work—arguing the streams represented lost revenue in active digital markets, even for vintage catalog material still generating income via platforms like Spotify. The case settled confidentially in September 2025, with no admission of liability by IA, but the claims underscored potential market harm to rights holders by enabling unauthorized playback that competes with paid services. IA maintained that such preservation efforts do not supplant modern consumption, yet the dispute highlights tensions where free archival access could diminish incentives for labels to invest in catalog maintenance or reissues, indirectly affecting artist estates and legacy royalties. Broader market effects include strained library-publisher negotiations, as IA's model pressures commercial pricing models, which already yield publishers higher margins (up to 50–70% on digital vs. 10–15% on physical lending). Critics of IA, including the Association of American Publishers, assert this fosters a "piracy-like" that discourages new by reducing predictable , though proponents cite traditional physical libraries as precedent without proven sales erosion. Courts' rejection of IA's defense prioritizes demonstrable economic harm to creators over unverified preservation benefits, reflecting causal realism in economics where unauthorized copies logically divert paying users.

Political and Ideological Biases in Archiving

The Internet Archive's archiving practices have drawn criticisms for exhibiting left-center ideological biases, particularly in and selective preservation decisions, despite its stated mission of universal access to knowledge. rated the organization as Left-Center biased in January 2024, citing its greater reliance on sources favoring left-leaning perspectives in curated collections, though it deemed the content mostly factual. These assessments stem from analyses of the Archive's sourcing patterns in thematic collections, such as those on social issues, where progressive viewpoints predominate without equivalent emphasis on conservative counterarguments. Founder has expressed views aligning with progressive priorities, such as advocating for publicly controlled digital access over private corporate models, as articulated in a 2023 interview where he framed as a political battle between public and private interests. Kahle's support for initiatives, including opposition to proprietary barriers in and software, reflects a worldview skeptical of market-driven information control, which critics argue influences prioritization in archiving—favoring anti-corporate or egalitarian narratives over free-market defenses. For instance, Kahle's involvement in preserving the 1996 U.S. presidential election records through partnerships like the Smithsonian demonstrates a commitment to electoral history, but selective emphases in related collections have been noted to underrepresent conservative policy archives from that era. A prominent example of alleged ideological occurred in September 2022, when the Internet Archive removed archives of the controversial forum Kiwifarms from its , diverging from prior policies that preserved contentious sites like despite their associations with . Kiwifarms, often criticized by progressive activists for documenting perceived online harassment (including against individuals), faced after Cloudflare terminated services amid threats; the Archive's subsequent purge was justified internally as a response to legal and safety risks, but observers highlighted it as inconsistent with the organization's historical tolerance for fringe content, suggesting acquiescence to external progressive pressure. This action contrasted with the Archive's retention of other ideologically charged materials, such as historical Nazi , which it defended as necessary for contextual preservation in 2021 discussions. Broader studies indicate that web archives like the Internet Archive's exhibit structural biases favoring content from powerful or English-dominant entities, potentially amplifying mainstream (often left-leaning institutional) narratives while marginalizing alternative . A 2004 analysis found significant national imbalances in coverage, with U.S.-centric crawling disadvantaging non-Western conservative perspectives. Additionally, fringe communities, including those promoting right-wing conspiracy theories, have misused the Archive for ideological dissemination, as documented in a 2018 study, but the organization's responses—such as content takedowns—appear more responsive to left-activist complaints than symmetric threats. These patterns underscore causal influences from founder ideology and external pressures, leading to non-neutral outcomes in what is purportedly comprehensive preservation.

Impact and Evaluation

Preservation Achievements

The Internet Archive's has archived over 1 trillion web pages as of October 2025, marking a significant in preserving spanning nearly three decades since its in 1996. This collection captures snapshots of websites at various points in time, allowing researchers and the public to access content that has since been deleted, altered, or lost due to site shutdowns, with studies indicating that approximately 25% of web pages from 2013 to 2023 have vanished from the live . The archive collaborates with over 1,250 partner libraries and organizations via services like Archive-It to curate specialized collections, ensuring comprehensive coverage of events, publications, and cultural artifacts. In book preservation, the Internet Archive operates scanning centers worldwide, digitizing around 4,400 books per day since 2005, resulting in millions of texts available for download or borrowing, particularly works predating 1929. This effort has made rare and out-of-print materials accessible, including over 11,000 digitized books from 1923 alone released into the in 2019. The organization's initiative further enhances preservation by cataloging and providing controlled digital lending of scanned volumes, supporting scholarly access to historical literature. The Archive has also amassed extensive audiovisual collections, including the TV News Archive, which holds over 3.5 million searchable U.S. broadcasts with , enabling analysis of news coverage dating back to 2009. Audio preservation includes 13 million recordings, such as live concerts and , while software emulation efforts maintain executable historical programs. These initiatives are supported by redundant storage exceeding 175 petabytes, with at least two copies of all data maintained to mitigate loss risks. Additionally, the Archive has archived at-risk federal government data in collaboration with institutions like , safeguarding vulnerable to policy changes.

Shortcomings and Failures

The Internet Archive has faced significant cybersecurity vulnerabilities, exemplified by a series of cyberattacks in October 2024 that exposed systemic weaknesses in its . On , 2024, hackers compromised the organization's database, resulting in a affecting approximately 31 million users, including the theft of usernames, email addresses, and salted-encrypted passwords. This breach was compounded by DDoS attacks that disrupted services for several days, rendering the and other collections inaccessible to millions of users. Further incidents on October 20 involved additional breaches and through a compromised , forcing the site into read-only mode and highlighting inadequate protections against persistent threats. These events not only interrupted access to preserved but also undermined trust in the Archive's ability to safeguard sensitive user data long-term. Archival completeness remains a persistent shortcoming, with empirical analyses revealing substantial gaps in coverage. Research indicates that 25% of web pages published between 2013 and 2023 have vanished entirely, and the Archive's crawls fail to capture much dynamic or paywalled content, contributing to "blind spots" in historical records. Between May and October 2025, snapshots of major news site homepages plummeted by 87% across 100 publications, attributed to breakdowns in automated archiving projects and resource constraints. Studies of large-scale archived data, such as Twitter records from 2009–2012 covering major events, show decay and incompleteness, with imperfect captures limiting utility for researchers. These gaps stem from the Archive's reliance on periodic crawls rather than continuous, exhaustive preservation, exacerbating the broader challenge of digital ephemerality. The policy of honoring robots.txt directives has drawn criticism for enabling retroactive content erasure, functioning as a de facto censorship mechanism. When websites update to disallow access, the Wayback Machine removes previously archived snapshots, allowing site owners to retroactively hide historical versions despite their prior public availability. This practice, rooted in respect for site owners' intent, contrasts with archival principles of permanence and has led to the disappearance of significant portions of the web record, such as when squatters or new owners block unrelated historical content. Although the Internet Archive adjusted its approach in 2017 to limit some retroactive effects, the policy persists in blocking visibility of pre-existing crawls, prioritizing current permissions over historical fidelity and hindering comprehensive preservation. Critics argue this voluntary compliance undermines the Archive's mission, as it cedes control to transient site policies rather than safeguarding knowledge.

Broader Implications for Digital Heritage

The ephemerality of poses significant risks to , with estimates indicating that approximately 25% of web pages cited in academic become inaccessible within a few years due to and site deletions. The Archive's has captured over 900 billion web pages since 1996, providing a critical snapshot of online history that would otherwise vanish, as evidenced by its role in preserving defunct sites like personal blogs and early forums. This preservation effort counters the inherent instability of digital platforms, where content removal by private entities—such as purges or corporate data policies—erodes collective memory without public recourse. Legal rulings against the Internet Archive, particularly the September 4, 2024, U.S. Court of Appeals decision upholding in the case, underscore tensions between preservation and intellectual property rights. The court rejected the Archive's controlled digital lending as , mandating removal of over 500,000 scanned books from circulation, which has already reduced access to out-of-print titles and prompted similar scrutiny of digital libraries. Such precedents may deter nonprofit archiving by increasing liability risks, potentially shifting reliance to permission-based models that favor rights holders and exclude orphaned or low-value works lacking commercial interest. These developments highlight a causal : while enforcement protects creators' incentives—evidenced by publishers' arguments that unauthorized lending displaces sales—overly restrictive interpretations could exacerbate digital loss, as physical libraries face obsolescence without viable digital equivalents. Independent archives like the Internet Archive fill gaps left by underfunded public institutions, but ongoing suits, including a 2025 record labels' claim seeking $700 million, signal a broader on scalable preservation . Without policy reforms, such as expanded for non-commercial archiving or mandatory deposits akin to print-era laws, digital heritage risks fragmentation, privileging monetizable content over comprehensive historical records.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.