68,000 Objects, No Keywords: The NGA Open Collection Explained
The National Gallery of Art released its entire collection as CC0 — no rights reserved, machine-readable, free to use for any purpose. Here's what's actually in it, what makes it unusual, and why it's the right dataset to build semantic search on.

68,000 Objects, No Keywords: The NGA Open Collection Explained
In 2019, the National Gallery of Art released its collection data under a Creative Commons Zero license. CC0 means no rights reserved — not just free to use, but explicitly placed in the public domain. The dataset includes structured metadata for every object in the open-access collection, high-resolution image access through the NGA's IIIF server, and a GitHub repository that is updated as the collection grows.
Most people who visit retrievals.app don't know this exists. This post explains what the dataset is, what's in it, and why it's the right foundation for semantic art search.
What CC0 Actually Means
Most museum open-access licenses come with conditions: attribution required, non-commercial use only, no derivative works. CC0 removes all of those. The NGA's data can be used in any project, for any purpose, without attribution, without restriction.
For a machine learning project, this matters enormously. Training embedding models on copyrighted images creates legal exposure. Building a search index on top of a collection that might revoke access creates operational exposure. CC0 removes both. The dataset is stable, legally clean, and explicitly intended for reuse.
The NGA also publishes the data on GitHub, which means changes are versioned. New acquisitions appear in the dataset. Metadata corrections are trackable. The index Retrievals builds from this data can be regenerated from a known, public source.
What's in the Collection
The dataset covers 68,816 objects as of the current index build. These span roughly seven centuries, from 13th-century Italian panel paintings to 20th-century American photographs.
The collection has particular depth in several areas:
Dutch and Flemish Golden Age (17th century) One of the NGA's strongest holdings. Rembrandt, Vermeer, de Hooch, Rubens, van Dyck, Hals, Leyster, Steen. Both paintings and an extensive print and drawing collection. This is the period where Retrievals' semantic search tends to surface the most surprising connections — the tradition is large enough and internally coherent enough that mood-based queries reliably find related works.
French Impressionism and Post-Impressionism Monet, Manet, Degas, Renoir, Cézanne, Seurat, van Gogh. The NGA's Impressionist holdings are among the best in the United States. Landscape queries in particular tend to surface strong results from this period.
American Art (18th–20th century) The NGA has an exceptional American collection: Hudson River School landscapes (Cole, Church, Bierstadt), Sargent portraits, Winslow Homer, Mary Cassatt, Edward Hopper. Queries about American light — the particular quality of 19th-century American landscape painting — reliably surface this material.
Works on Paper A significant fraction of the 68,816 objects are prints, drawings, and watercolors rather than oil paintings. This gives the collection unusual range. A query about "precise line, botanical subject" surfaces Dutch botanical engravings. "Gestural ink sketch" finds drawings that a painting-only collection wouldn't have.
Italian Renaissance and Baroque Raphael, Leonardo (studies), Titian, Caravaggio, Tiepolo. Not as deep as the Dutch holdings, but well-represented.
What's Not in the Collection
Understanding the gaps matters for knowing what Retrievals can and can't find.
Modern and contemporary works under copyright. The CC0 release covers public-domain works. Anything created after roughly 1927 by a living or recently deceased artist is typically absent. The NGA holds works by Rothko, de Kooning, and others that don't appear in the open-access dataset.
Decorative arts and three-dimensional objects. The NGA's collection includes furniture, textiles, and sculpture, but these are underrepresented in the image-indexed portion of the open-access dataset. The semantic index is strongest on two-dimensional works.
Non-image records. Some objects in the dataset have incomplete image records — the metadata exists but no usable image was available at index build time. These objects are excluded from search.
flowchart TD
A["NGA collection\n~150,000 total objects"]
B["Open-access dataset\n68,816 objects"]
C["Works with\nusable images"]
D["Retrievals\nsearch index"]
A --"CC0 public domain"--> B
B --"image available"--> C
C --"Qwen3-VL\nembedded"--> D
classDef full fill:#f5f0e8,stroke:#d1c7b7,color:#2c2926;
classDef open fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
classDef image fill:#fcf9f2,stroke:#a86845,color:#2c2926;
classDef index fill:#fffaf0,stroke:#2c2926,color:#2c2926,stroke-width:2px;
class A full;
class B open;
class C image;
class D index;
Why This Dataset Is Unusual
Most museum open-access releases are partial. A selection of highlights. A sample of the collection. The NGA released everything in their public-domain holdings, which means the dataset includes not just famous works but the full depth of each tradition — the studies, the lesser-known artists, the prints that never make it onto gallery walls but are part of the visual culture that the famous paintings came from.
That depth is what makes semantic search interesting. A query about "Baroque chiaroscuro" doesn't just return Caravaggio. It returns the northern European painters who worked in the same tradition, the prints that disseminated those techniques, and the later artists who responded to them. The famous works are there, but so is the context around them.
The NGA also maintains IIIF endpoints for high-resolution image access. Retrievals stores thumbnails on Cloudflare R2 for fast search results, but every result links back to the NGA's own IIIF server for the full-resolution image. The collection's canonical home stays with the institution; Retrievals provides a new way into it.
The Index Retrievals Builds
From the 68,816-object dataset, the offline pipeline:
- Normalizes each record into a compact metadata object (title, artist, date, medium, dimensions, image source, artwork URL)
- Fetches the image from the NGA's IIIF server or Cloudflare R2
- Embeds the image through Qwen3-VL-Embedding-2B, producing a 1024-dimensional vector
- Stores the vector alongside the metadata in a FAISS IVFFlat index
The result is a semantic map of the NGA's public-domain holdings — every object in the collection represented as a point in a space where visual and conceptual similarity are the same thing.
The architecture post covers this pipeline in detail. What matters here is the source: every vector in the index corresponds to a public-domain object from a public institution, available under a license that places no restrictions on use. The index is as open as the collection it was built from.