Recent from talks
Nothing was collected or created yet.
LAION
View on WikipediaLAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets.[1] It is best known for releasing a number of large datasets of images and captions scraped from the web which have been used to train a number of high-profile text-to-image models, including Stable Diffusion and Imagen.[2][3]
Key Information
In February 2023, LAION was named in the Getty Images lawsuit against Stable Diffusion as a non-party.[4] In April 2023, LAION was directly sued by a German photographer who wanted to have his images removed from the training set.[5] In September 2024, the Regional Court of Hamburg dismissed the lawsuit, in what was described as a "landmark ruling on TDM [Text and data mining] exceptions for AI training data" in Germany and the EU more generally.[6]
On April 15, 2023, LAION and contributors publicly released an open source AI assistant chatbot called OpenAssistant.
Image datasets
[edit]LAION has publicly released a number of large datasets of image-caption pairs which have been widely used by AI researchers.[citation needed] The data is derived from the Common Crawl, a dataset of scraped web pages. The developers searched the crawled html for <img> tags and treated their alt attributes as captions. They used CLIP to identify and discard images whose content did not appear to match their captions.[7] LAION does not host the content of scraped images themselves; rather, the dataset contains URLs pointing to images, which researchers must download themselves.[8]
The first such dataset, LAION-400M, was released in August 2021 and consisted of 400 million image-caption pairs. The pairs were extracted from a random subset of webpages scraped by Common Crawl between 2014 and 2021.[9] It was an attempt to recreate the process used by OpenAI to collect the 400 million image-caption pairs they used to train the CLIP model - the company had chosen to open-source the model's code and weights, but not its training dataset.[7] Imagen, a text-to-image model announced by Google Brain in 2022, was trained on LAION-400M in combination with private internal datasets.[10]
A successor of more than 5 billion pairs, LAION-5B, was released in March 2022.[11] As of its release, it was the largest freely available dataset of image-caption pairs in existence.[7] Its creation was funded by Doodlebot, Hugging Face and Stability AI, the AI company behind the funding of the Stable Diffusion text-to-image model, which was trained on it.[12]
Criticism
[edit]Several studies show that the images in LAION-5B contain problematic images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.[13][14]
An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data harvested from public websites.[15]
In December 2023, the Stanford Internet Observatory released a report on LAION-5B that found 3,226 suspected instances of links to child sexual abuse material with 1,008 of these being externally validated. In response, LAION temporarily removed LAION-5B and LAION-400M citing its "zero tolerance policy for illegal content" and "an abundance of caution".[16] In August 2024, LAION released a cleaned dataset called Re-LAION-5B.[17]
OpenAssistant
[edit]| OpenAssistant | |
|---|---|
Screenshot of the data collection web portal | |
| Developers | LAION and contributors |
| Initial release | 15 April 2023 |
| Type | |
| License | Apache License 2.0 |
| Website | open-assistant |
OpenAssistant was an artificial intelligence (AI) open source chat-based assistant that could understand tasks, interact with third-party systems and retrieve information dynamically to do so. The project was developed by a group of volunteers in collaboration with LAION. One of the goals for development included free access to large language models that can be run locally on consumer hardware.[18][19] The project was backed by a worldwide crowdsourcing effort involving over 13,500 volunteers who have created 600k human-generated data points.[19][20] The project has since been shut down; however, the datasets and models remain available on Hugging Face.
See also
[edit]References
[edit]- ^ "About". LAION.ai. Retrieved 26 September 2022.
- ^ Edwards, Benj (15 September 2022). "Have AI image generators assimilated your art? New tool lets you check". Ars Technica.
- ^ Newman, Marissa; Cantrill, Aggi (24 April 2023). "The Future of AI Relies on a High School Teacher's Free Database". Bloomberg News. Retrieved 24 April 2023.
- ^ "Getty Images (US), Inc. v. Stability AI, Inc., 1:23-cv-00135". CourtListener. Retrieved 2023-02-08.
- ^ "A Photographer Tried to Get His Photos Removed from an AI Dataset. He Got an Invoice Instead". Vice. 28 April 2023. Retrieved 2023-05-04.
- ^ Goldstein, Paul; Stuetzle, Christiane; Bischoff, Susan (2024-11-13). "Kneschke vs. LAION - Landmark Ruling on TDM exceptions for AI training data – Part 1". Kluwer Copyright Blog. Retrieved 2024-11-25.
- ^ a b c Alford, Anthony (17 May 2022). "LAION Releases Five Billion Image-Text Pair Dataset LAION-5B". InfoQ.
- ^ Edwards, Benj (21 September 2022). "Artist finds private medical record photos in popular AI training data set". Ars Technica.
- ^ Schuhmann, Christoph (8 August 2021). "LAION-400-Million Open Dataset". LAION blog. Retrieved 26 September 2022.
- ^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad (23 May 2022). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].
- ^ Beaumont, Romain (3 March 2022). "LAION-5B: A New Era of Open Large-Scale Multi-Modal Datasets". LAION blog.
- ^ Wiggers, Kyle (12 August 2022). "This startup is setting a DALL-E 2-like AI free, consequences be damned". TechCrunch.
- ^ Birhane, Abeba; Prabhu, Vinay Uday; Kahembwe, Emmanuel (2021). "Multimodal datasets: misogyny, pornography, and malignant stereotypes". arXiv:2110.01963.
{{cite journal}}: Cite journal requires|journal=(help) - ^ Birhane, Abeba; Prabhu, Vinay; Han, Sang; Boddeti, Vishnu Naresh; Luccioni, Alexandra Sasha (2023-11-06), Into the LAIONs Den: Investigating Hate in Multimodal Datasets, arXiv:2311.03449
- ^ Brunner, Katharina; Harlan, Elisa (2023-06-07). "We Are All Raw Material for AI". Bayerischer Rundfunk.
- ^ Cole, Samantha (20 December 2023). "Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material". 404 Media. Retrieved 22 December 2023.
- ^ Belanger, Ashley (2024-08-30). "Nonprofit scrubs illegal content from controversial AI training dataset". Ars Technica. Retrieved 2024-08-31.
- ^ Open-Assistant, LAION AI, 2023-03-09, retrieved 2023-03-09
- ^ a b Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew (2023-04-14). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". arXiv:2304.07327 [cs.CL].
- ^ "Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development". KDnuggets. Retrieved 2023-05-05.[permanent dead link]
LAION
View on GrokipediaFounding and History
Establishment and Early Milestones
LAION, the Large-scale Artificial Intelligence Open Network, was founded in the summer of 2021 in Germany as a non-profit organization aimed at democratizing access to large-scale machine learning resources through open datasets, tools, and models.[10] The initiative was led by Christoph Schuhmann, a physicist with a master's degree and a high school teacher, who coordinated a global team of volunteers working remotely to address the lack of openly available data for training multimodal AI systems like OpenAI's CLIP model.[11] [12] Established without initial corporate funding, LAION relied on community contributions and public grants to scale its efforts, emphasizing efficient data curation over proprietary alternatives.[1] One of the organization's first major milestones was the release of the LAION-400M dataset on August 20, 2021, comprising 400 million English-language image-text pairs filtered using CLIP embeddings and derived from Common Crawl web scrapes.[13] [14] This non-curated dataset, accompanied by k-nearest neighbors indices for similarity search, marked the largest openly accessible multimodal resource at the time, enabling researchers to replicate and extend CLIP-like training without relying on closed-source data.[14] Despite containing some not-safe-for-work content, LAION-400M prioritized research utility and transparency, with explicit warnings against commercial deployment.[13] Building on this foundation, LAION rapidly expanded its scope in early 2022 by releasing LAION-5B on March 31, 2022—a dataset of 5.85 billion CLIP-filtered image-text pairs, 14 times larger than its predecessor and sourced from over 12 trillion tokens in Common Crawl archives.[15] This milestone facilitated breakthroughs in open-source generative models, including Stability AI's Stable Diffusion, trained on a subset of LAION-5B, and underscored LAION's role in accelerating accessible AI development amid concerns over data centralization in big tech.[16] These early releases established LAION's methodology of web-scale data acquisition, aesthetic and semantic filtering, and public dissemination under permissive licenses.Expansion and Key Developments
Following the initial release of LAION-400M on August 20, 2021—a dataset comprising 400 million English-language image-text pairs scraped from Common Crawl and processed via distributed computing—LAION scaled its operations dramatically, leveraging volunteer contributions and open-source tools to produce larger, more refined resources.[13] This early milestone enabled broader experimentation in multimodal AI training, with the dataset's non-curated nature highlighting both its accessibility and the raw scale of web-derived data.[13] A pivotal expansion occurred with the March 2022 launch of LAION-5B, which grew to 5.85 billion CLIP-filtered image-text pairs, incorporating multilingual captions, aesthetic scoring via CLIP models, and quality heuristics to prioritize high-relevance content for vision-language tasks.[15] This dataset's influence extended to commercial applications, as Stability AI employed a 2-billion-pair subset to train Stable Diffusion, an open text-to-image model released in August 2022, which achieved state-of-the-art performance while relying on computationally efficient distillation techniques.[16] The release underscored LAION's role in accelerating open-source generative AI, though it also drew scrutiny for unfiltered web data containing copyrighted or sensitive material.[17] In response to identified risks, including the presence of child sexual abuse material (CSAM) and toxic content verified through external audits, LAION iterated on its methodology with the August 30, 2024, release of Re-LAION-5B. This refined version retained 2.7 billion pairs after applying advanced deduplication, watermark detection, and hash-based removal of over 400,000 known harmful URLs, reducing ethical liabilities while maintaining utility for model training.[18] Organizationally, LAION formalized as a German e.V. non-profit, expanded its volunteer-driven team into structured collaborations with researchers, and diversified tools like img2dataset for scalable image downloading, supporting further growth in dataset curation efficiency.[19]Mission and Organizational Overview
Core Objectives and Non-Profit Model
LAION operates as a 100% non-profit organization with the core mission to democratize machine learning research and its applications, asserting that these fields hold substantial potential for positive global impact and thus warrant broad accessibility.[1] The organization seeks to liberate machine learning development by making large-scale datasets, models, tools, and related code freely available to the public, thereby enabling researchers, educators, and developers worldwide to advance AI without proprietary barriers.[19] This approach emphasizes public education on large-scale machine learning practices, including data management techniques, while prioritizing the reuse of existing computing resources to minimize environmental costs associated with training computationally intensive models.[1] Central to LAION's objectives is the provision of open resources that facilitate reproducible and scalable AI research, countering the trend of closed datasets controlled by commercial entities. By focusing on high-quality, ethically filtered multimodal datasets derived from public web sources, LAION aims to foster innovation in areas such as computer vision and natural language processing, ultimately promoting equitable access to foundational AI infrastructure.[1] The organization also commits to advancing sustainable AI practices, advocating for energy-efficient methodologies in dataset curation and model training to mitigate the carbon footprint of large-scale machine learning.[19] As a non-profit entity registered in Germany with a global membership base, LAION sustains its operations through donations and public research grants, eschewing revenue models that could compromise data openness or introduce commercial incentives.[1] This funding structure ensures that all outputs remain freely accessible under permissive licenses, aligning with the organization's goal of rendering cornerstone advancements in large-scale AI publicly available to any interested community, without reliance on venture capital or corporate partnerships that might prioritize proprietary outcomes.[1] Governance emphasizes collaborative, volunteer-driven contributions from international experts, maintaining transparency in project development while avoiding conflicts of interest inherent in for-profit alternatives.[1]Team, Governance, and Collaborations
LAION e.V. is structured as a registered non-profit association (eingetragener Verein) under German law, operating as a community-driven open network dedicated to advancing open-source AI research.[1][20] As a non-profit, it emphasizes democratic access to machine learning resources without commercial motives, relying on volunteer contributions and memberships rather than hierarchical corporate governance.[19] Governance is decentralized, with decisions influenced by a core group of founders and researchers, though formal board structures typical of e.V. associations—such as member assemblies and elected executives—guide operations, prioritizing transparency and public benefit over profit.[1] The founding team, established around 2021, includes nine key individuals who initiated LAION's efforts to create large-scale open datasets. Christoph Schuhmann serves as Organizational Lead and Founder, holding a master's in physics and computer science, with experience in educational initiatives like Schools of Trust. Jenia Jitsev acts as Scientific Lead and Founder, a senior researcher at the Jülich Supercomputing Centre leading the SLAMPAI Lab, with a PhD in computer science and expertise in neuroscience and machine learning. Richard Vencu is Engineering Lead and Founder, an AI engineer with 28 years of industry experience in automation and electronics. Other founders include Romain Beaumont (open-source scaling specialist), Robert Kaczmarczyk (community and operational lead with epidemiological research background), Theo Coombes (big data programmer), Mehdi Cherti (core researcher on generative models), Aarush Katta (AI programmer), and Jan Ebert (software engineer at Helmholtz AI).[12] Beyond founders, LAION's extended team comprises approximately 22 members, including senior researchers affiliated with institutions such as Stanford University (Ludwig Schmidt), Université de Montréal (Irina Rish), Tokyo Institute of Technology (Rio Yokota), and the University of Hamburg. Huu Nguyen heads safety policy as a computer scientist and lawyer with over 15 years of experience. This distributed structure fosters expertise in areas like diffusion models, scaling laws, and multimodal learning, with members contributing to open-source tools and datasets.[12] LAION collaborates with academic and research entities to scale its projects, including partnerships with Intel's AI Center of Excellence for AI-assisted education tools like BUD-E 1.0 and the German Research Center for Artificial Intelligence (DFKI).[21] Members' affiliations enable ties to networks like the European Laboratory for Learning and Intelligent Systems (ELLIS) and Helmholtz AI, supporting joint efforts in foundation models and supercomputing applications.[12] The organization also engages in international advocacy, petitioning bodies like the European Parliament for open AI policies and forming ad-hoc collaborations across universities to promote responsible AGI research.[22][23]Datasets and Projects
Image and Multimodal Datasets
LAION's image and multimodal datasets primarily consist of large-scale collections of image-text pairs derived from web crawls, designed to support training of vision-language models such as CLIP and diffusion-based generators. These datasets provide alignments between images (via URLs) and associated textual descriptions (e.g., alt-text or captions), enabling zero-shot classification, text-to-image synthesis, and other multimodal tasks without proprietary labeled data. They emphasize openness, with metadata, embeddings, and tools released under permissive licenses to facilitate reproducible research.[15][3] The inaugural dataset, LAION-400M, was released on August 20, 2021, containing approximately 400 million English-language image-text pairs sourced from Common Crawl archives (2014–2021). Pairs were filtered using OpenAI's CLIP model with a cosine similarity threshold of 0.3 to prioritize high-quality alignments, alongside exclusions for NSFW content via metadata flags, short texts, low-resolution images, and duplicates. Available in formats like 50 GB Parquet metadata files and 10 TB webdataset archives (with 256x256 pixel images), it includes CLIP ViT-B/32 embeddings and supports tools for downloading and visualization, such as img2dataset and clip-retrieval. Intended for research rather than production, LAION-400M marked a shift toward scalable open alternatives to closed datasets, demonstrating viability for training models comparable to CLIP on public web data.[13] Building on this, LAION-5B was announced on March 31, 2022, scaling to 5.85 billion CLIP-filtered image-text pairs—14 times larger than its predecessor—with 2.32 billion in English (Laion2B-en subset), 2.2 billion multilingual (Laion2B-multi), and 1 billion language-unassignable (Laion1B-nolang). From over 50 billion candidate pairs, filtering applied CLIP ViT-L/14 cosine similarity thresholds (0.28 for English, 0.26 otherwise), minimum text lengths (5 words), image resolutions (≥128 pixels on the smaller side), and deduplication, yielding ~3% NSFW content and ~5–6% watermarked images per subset. Accompanied by 28 billion CLIP embeddings, k-nearest neighbor indices, and safety scores, the dataset powers open reproduction efforts for multimodal models like OpenCLIP and applications like Stable Diffusion training and includes a web interface for exploration. Its multilingual scope and subdataset curation capabilities have advanced out-of-distribution robustness and task-specific fine-tuning in vision-language research.[15][3] In response to safety concerns, Re-LAION-5B was released on August 30, 2024, as a 5.53 billion-pair subset of LAION-5B with targeted removals of 2,236 links matching suspected child sexual abuse material hashes from sources including the Internet Watch Foundation, Project C3P, and Stanford's 2023 report. Offered in "research" (core cleaned version) and "research-safe" (with added NSFW filtering) variants on Hugging Face, it prioritizes full reproducibility using 100% open-source web data and tools, addressing prior opacity in dataset iteration while maintaining scale for language-vision model development.[18] These datasets function as indexes rather than stored media, linking to original web-hosted images to minimize storage demands and respect potential intellectual property constraints, though users must handle downloading and ethical usage independently. Derived subsets, such as LAION-Aesthetics (filtered for higher aesthetic scores via CLIP-based predictors), further enable specialized applications like improved image generation quality.[15]Language Models and Assistants
Open Assistant is an open-source project developed by LAION to create a chat-based large language model accessible on consumer-grade hardware, such as a single high-end GPU.[24] The initiative emphasizes human-centered design, task understanding, interaction with third-party systems, and dynamic information retrieval to foster innovation in language models.[25] Key components include community-driven data collection via crowdsourcing for high-quality instruction-fulfillment samples and application of reinforcement learning from human feedback (RLHF) alongside preference modeling to align models as helpful assistants.[26] [25] Central to the project is the Open Instruction Generalist (OIG) dataset, released on March 10, 2023, comprising approximately 43 million instructions derived from 30 constituent datasets.[27] This resource, blending 75% academic sources like P3 and FLAN with diverse synthetic and augmented data for tasks including dialogue, coding, and creative writing, facilitates converting pre-trained language models into instruction-following systems through continued pre-training and fine-tuning.[27] The final oasst2 dataset, hosted on Hugging Face, aggregates over 50,000 human-generated samples refined via ranking processes.[25] These datasets underpin supervised fine-tuning and RLHF stages modeled after InstructGPT methodologies.[25] LAION released OpenAssistant publicly on April 15, 2023, following an early preview of the supervised fine-tuned (SFT) 12-billion parameter model on March 12, 2023.[28] [29] Iterative alpha and beta versions, such as v0.0.1-beta48 on February 23, 2023, supported ongoing refinements until project completion announced on October 25, 2023.[30] The resulting models prioritize efficiency for local deployment while aiming for proficiency comparable to proprietary systems.[28] Extending to multilingual capabilities, LAION's Anh project builds on OIG and Open Assistant frameworks to develop chatbots supporting diverse languages, with initial emphasis on Vietnamese as part of broader open-chat ecosystems.[31] Similarly, LeoLM, introduced on September 28, 2023, represents LAION's suite of linguistically enhanced foundation models optimized for German-language tasks.[32] In voice-assisted applications, BUD-E (Buddy for Understanding and Digital Empathy) emerged as an open-source framework announced in February 2024, designed for natural, empathic interactions on consumer hardware without internet dependency.[33] Version 1.0, released January 20, 2025, integrates privacy-compliant AI for educational use via browser-based interfaces and self-hosted APIs, incorporating fine-tuned speech recognition, language understanding, and text-to-speech models.[21] These efforts align with LAION's calls for open multi-modal personal assistants capable of processing audio alongside text.[34]Specialized and Emerging Datasets
LAION has produced specialized datasets that apply targeted filtering or curation to subsets of its core collections, enhancing utility for niche applications such as aesthetic evaluation, logo recognition, and instruction-following in language models. These efforts address limitations in general-purpose datasets by emphasizing quality metrics, domain-specific content, or safety refinements, though they remain non-curated at scale and inherit web-scraped data risks like duplication or bias amplification.[35][27] The LAION-Aesthetics dataset, introduced in August 2022, derives from LAION-5B by scoring image-text pairs using a linear estimator trained atop CLIP embeddings, inspired by the Aesthetic Visual Analysis (AVA) dataset's human-rated benchmarks. It prioritizes pairs with predicted aesthetic scores above thresholds (e.g., >7 for core subsets, with watermarks and unsafe content filtered below 0.8 and 0.5 probabilities, respectively), yielding hundreds of millions of high-visual-quality examples suitable for training generative models less prone to low-effort outputs. An updated LAION-Aesthetics V2 incorporates refined predictors for broader applicability, while a companion LAION-Logos subset comprises 15,000 pairs focused on branded imagery with 1-10 aesthetic ratings to bolster object detection in commercial contexts.[35][36] These filters demonstrably improve downstream model performance in image synthesis tasks, as evidenced by reduced artifacts in evaluations, though reliance on proxy predictors introduces estimation errors over human judgments.[35] In language domains, the Open Instruction Generalist (OIG) dataset, released March 2023, aggregates approximately 43 million synthetic instructions across categories like role-playing, reasoning, and coding, generated via templating from existing texts to simulate diverse prompts without proprietary data. Designed for fine-tuning open assistants, it emphasizes generality over specialization, with variants like OIG-moderation targeting safety alignment by curating adversarial examples. Empirical tests show OIG-trained models achieving competitive benchmarks in instruction adherence, outperforming smaller closed datasets in zero-shot tasks, yet analyses reveal persistent gaps in factual accuracy due to synthetic origins.[27] Emerging multimodal extensions include DataComp, launched April 2023 as a benchmark rather than raw data, evaluating dataset construction pipelines across 12.8 million candidates filtered for quality, fairness, and licensing via modular scorers. It highlights causal trade-offs in scaling, with top pipelines yielding models rivaling proprietary ones on image retrieval metrics like zero-shot ImageNet accuracy (up to 75%). More recent ventures venture into audio with LAION-DISCO-12M (November 2024), linking 12 million YouTube audio clips for music information retrieval, enabling cross-modal training absent in prior image-centric releases. Similarly, LAION POP (November 2023) curates 600,000 high-resolution images with granular captions for advanced generation research, while synthetic strategic game datasets (October 2023) generate procedural scenarios to hone AI planning without real-world biases. These nascent efforts underscore LAION's pivot toward diverse modalities, though small scales relative to flagships limit immediate impact, and procedural generation risks overfitting to artificial structures.[37][38]Technical Approach
Data Acquisition and Processing
LAION's data acquisition begins with sourcing from the Common Crawl, a non-profit web archive containing petabytes of crawled web data from snapshots spanning 2014 to 2021.[39][40] Researchers parse Web Archive Transform (WAT) files derived from Common Crawl's WARC format to efficiently extract metadata, focusing on HTML<img> tags paired with associated text such as alt attributes or surrounding captions.[3] This yields billions of candidate image-text pairs; for instance, processing Common Crawl snapshots like CC12 and CC13 produced an initial 12.8 billion pairs for LAION-5B.[3]
The extraction process employs distributed computing to handle the scale, utilizing asynchronous libraries like Trio and Asks for batch downloads of image URLs, typically processing 10,000 links per batch on low-resource nodes (1-2 vCPUs, 0.5-1 GB RAM, 5-10 Mbps bandwidth).[39] Pairs are stored in databases such as PostgreSQL via bulk COPY operations, with language detection via tools like cld3 to categorize subsets (e.g., 2.3 billion English pairs in LAION-5B).[39] This pipeline evolved from LAION-400M, released in August 2021, which processed similar Common Crawl data to yield 400 million English pairs at a rate of 25 million per day using 100 CPU workers and one GPU.[40]
Filtering prioritizes relevance and quality using the CLIP ViT-L/14 model, computing cosine similarity between image and text embeddings with thresholds of 0.28 for English and 0.26 for other languages, which discards approximately 90% of candidates and retains 5.85 billion pairs for LAION-5B, released in March 2022.[39][3] An additional aesthetic predictor scores images on a 1-10 scale, requiring scores above 5 to ensure visual appeal. Deduplication follows via perceptual hashing (e.g., dHash for images) and text similarity (e.g., MinHash), supplemented by Bloom filters on URLs, reducing redundancy while preserving 2.32 billion English pairs.[3]
Post-filtering steps include bulk image downloads via tools like img2dataset, achieving 5.85 billion samples in one week on 10 nodes, followed by computation of ViT-L/14 embeddings on 32 A100 GPUs at 312 samples per second per GPU.[39] Safety classifiers tag NSFW content and watermarks, though these are not fully removed in the base dataset to maintain openness. The resulting datasets emphasize scalability and openness, enabling downstream AI training without proprietary curation.[39]