Behind the Canvas: The Embedding Architecture of Retrievals
How Retrievals turns the National Gallery of Art image collection into a searchable embedding index using Modal L40S GPUs, Qwen3-VL-Embedding-2B, and FAISS vector search.

Behind the Canvas: The Embedding Architecture of Retrievals
Traditional artwork search is usually bounded by catalog fields: title, artist, medium, year, and a small amount of descriptive metadata. Retrievals is built around a different primitive. It embeds images and queries into the same semantic space, then uses vector search to find artworks by visual meaning rather than exact keywords.
The current system is intentionally direct, as search quality comes from the image dataset embeddings, the query embedding generated at request time, and a FAISS index tuned for fast nearest-neighbor retrieval. The API returns the query vector with the first page so the app can keep paginating without recomputing the embedding.
The System Blueprint
Retrievals has two major paths. The offline path builds the artwork embedding index once from the National Gallery of Art image dataset. The online path normalizes each text query, embeds it into a 1024-dimensional vector, and searches that prebuilt index.
flowchart LR
subgraph Offline["Offline index build"]
A["NGA images"]
B["Qwen3-VL image embeddings"]
C[("FAISS index")]
end
subgraph Online["Live search"]
D["Text query"]
E["Qwen3-VL query embedding"]
F["FAISS match"]
G["64-result window"]
H["16-result pages"]
end
A --> B
B --> C
D --> E
E --> F
C --> F
F --> G
G --> H
classDef source fill:#f5f0e8,stroke:#a86845,color:#2c2926;
classDef model fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
classDef index fill:#fcf9f2,stroke:#2c2926,color:#2c2926,stroke-width:2px;
classDef result fill:#fffaf0,stroke:#a86845,color:#2c2926;
class A,D source;
class B,E model;
class C,F index;
class G,H result;
This split keeps the expensive image embedding work out of the request path. The live app embeds the user's query once, compares it against the stored vectors, and returns matching artwork metadata and image URLs. Follow-up pages reuse the returned query vector.
1. Embedding the Image Dataset
The dataset pipeline starts with open artwork records and image assets from the National Gallery of Art. Each eligible artwork is normalized into a compact metadata record and paired with a usable image source. The image is then embedded with Qwen3-VL-Embedding-2B, producing a vector representation of its visual content.
Those vectors become the search corpus, and metadata stays attached to each vector so the app can return a title, artist, date, thumbnail, normalized IIIF image URL, and artwork URL without asking the model to do more work during search.
flowchart TD
A["Artwork<br/>record"]
B["IIIF<br/>image"]
C["Clean<br/>input"]
D["Qwen3-VL<br/>image vector"]
E[("1024D vector +<br/>metadata")]
F[("FAISS<br/>corpus")]
A --> B
B --> C
C --> D
D --> E
A -. "metadata" .-> E
E --> F
classDef record fill:#fcf9f2,stroke:#a86845,color:#2c2926;
classDef image fill:#f5f0e8,stroke:#d1c7b7,color:#2c2926;
classDef model fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
classDef artifact fill:#fffaf0,stroke:#2c2926,color:#2c2926,stroke-width:2px;
class A record;
class B,C image;
class D model;
class E,F artifact;
The important design choice is that image understanding happens before users search. The live application does not need to inspect every image in real time; it compares against a precomputed semantic map of the collection.
2. Modal GPU Runtime
The live search runtime runs on Modal using NVIDIA L40S GPUs. That matters because the request path still has to load the Qwen3-VL embedding model, embed the user's query, and keep the FAISS artifacts available for fast lookup.
Modal gives the project a Python-defined runtime for GPU work without keeping fixed infrastructure online all the time. The search container mounts the serving artifacts, loads the Qwen3-VL embedder, reads the FAISS index and object map, and serves authenticated FastAPI endpoints for the Next.js app.
flowchart LR
A["Modal<br/>app image"]
B{{"L40S GPU"}}
C["Qwen3-VL<br/>Embedding 2B"]
D[("FAISS index +<br/>object map")]
E["FastAPI<br/>/search"]
F["FastAPI<br/>/search-more"]
A --> B
B --> C
D --> E
D --> F
C --> E
classDef runtime fill:#f5f0e8,stroke:#6f685f,color:#2c2926;
classDef gpu fill:#fffaf0,stroke:#a86845,color:#2c2926,stroke-width:2px;
classDef model fill:#ebe5d9,stroke:#a86845,color:#2c2926;
classDef artifact fill:#fcf9f2,stroke:#2c2926,color:#2c2926,stroke-width:2px;
class A,E,F runtime;
class B gpu;
class C model;
class D artifact;
This keeps the architecture practical. GPU work is used where it matters in the live path: generating the query embedding. The rest of the request is a structured vector lookup over already-built artifacts.
3. Query Embedding and Vector Search
When someone searches Retrievals, the query goes through the same semantic model family as the artwork dataset. For example:
"a lonely boat in stormy seas under a dark moonlit sky"
That text is embedded into a 1024-dimensional vector. FAISS searches the top candidate window, the API returns up to 64 initial results plus the query vector, and the UI can request additional 16-result pages from the raw 256-result window.
flowchart TD
A["User query"]
B["1024D query vector"]
C[("FAISS top 80")]
D["Initial 64 results"]
E["More pages: 16 at a time"]
A --> B
B --> C
C --> D
B --> E
C --> E
classDef query fill:#f5f0e8,stroke:#a86845,color:#2c2926;
classDef index fill:#fcf9f2,stroke:#2c2926,color:#2c2926,stroke-width:2px;
classDef result fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
class A,B query;
class C index;
class D,E result;
The current ranking described here is vector-search driven: embed the query, search FAISS, return the nearest artworks, and reuse the query vector for pagination.
Why This Matters
Retrievals works because the heavy visual interpretation is moved into the dataset layer. Once the image collection has been embedded, the live app can be small and responsive while still searching by visual meaning.
- Image-first retrieval: Artwork images are embedded directly, so the index can capture visual composition, style, color, and mood.
- Fast online search: The request path embeds the query once, searches the FAISS index, and returns the query vector for pagination.
- GPU where it counts: Modal L40S containers handle model-heavy query embedding without turning the web app into GPU infrastructure.
- Predictable pagination: The first response can return up to 64 results; additional pages use the same query vector in 16-result slices up to the 256-result raw window.
Explore the collection yourself on the Retrievals search page.