Behind the Canvas: The Multi-Modal Architecture of Semantic Art
An in-depth look at how we built a state-of-the-art visual discovery engine for the National Gallery of Art using Next.js, serverless GPU containers on Modal, FAISS vector search, and Qwen3-VL multi-modal reranking.

Behind the Canvas: The Multi-Modal Architecture of Semantic Art
In traditional museum archives, searching for artwork is a frustrating exercise in matching tags. If you search for "a lonely boat in stormy seas under a dark, moonlit sky," you are completely at the mercy of whether a cataloger manually typed those exact keywords into a database years ago. If they simply labeled the painting "Marine view," it remains forever lost to your query.
Semantic Art was built to solve this. By marrying state-of-the-art Vision-Language Models (VLMs) with serverless GPU containerization, we built a visual discovery engine that "sees" art the way humans do—capturing style, medium, mood, color palette, and complex visual compositions.
In this post, we’ll take you behind the scenes of our multi-modal architecture, from the React front-end down to the high-performance GPU clusters running FAISS and Qwen3-VL.
The System Blueprint
At a high level, the application is designed as a split-responsibility architecture. A lightweight web client coordinates the user experience, while an autonomous serverless GPU backend acts as the "AI Brain."
graph TD
subgraph Client ["Next.js React Client Workspace"]
A["components/search-page-client.tsx<br>(User Search UI)"] -->|1. user query + AbortSignal| B["components/search-results-grid.tsx<br>(Infinite Scroll Render)"]
end
subgraph Gateway ["Next.js Secure API Middleware"]
C["app/api/search/route.ts<br>(Secure Rate-Limited Proxy)"]
end
subgraph Modal ["Modal Serverless GPU Brain"]
D["modal_apps/search.py<br>(Modal FastAPI App)"]
E["qwen3_vl_embedding.py<br>(Qwen3-VL Embedder)"]
F["faiss_index<br>(FAISS Vector Index)"]
G["qwen3_vl_reranker.py<br>(Qwen3-VL Reranker Model)"]
end
subgraph Museum ["U.S. National Gallery of Art"]
H["Museum IIIF Servers<br>(High-Res Artwork Images)"]
end
B -->|2. POST /api/search| C
C -->|3. POST /search (with AbortSignal propagation)| D
D -->|4. Get query vector| E
E -->|5. Vector search (K=64)| F
F -->|6. Retrieve candidate IDs| D
D -->|7. Fetch visual assets| H
H -->|8. High-Res Image stream| D
D -->|9. Cross-Attention Reranking| G
G -->|10. Blended scores (Top N=24)| D
D -->|11. JSON Results| C
C -->|12. Rendered Paintings| B
This divide-and-conquer strategy ensures that our web server remains incredibly cheap, fast, and secure, while reserving expensive GPU compute solely for active queries.
1. The Middleman: Next.js & Abort-Aware Proxying
The frontend is built on Next.js (App Router) and styled using a curated cream and sepia palette (#f5f0e8) designed to mirror the physical texture of premium print journals.
When a user types a query, the React search interface talks to a Next.js API Route (/api/search). Rather than exposing our backend GPU credentials directly to the browser, the Next.js API acts as an intelligent proxy.
Defeating the "Detached Request" Double-GPU Bug
In multi-modal AI setups, GPU containers take time to process requests. If a user double-clicks the search button or closes their laptop mid-search, a naive proxy leaves the backend GPU running in a "detached" state—processing a query that has no listener, which wastes valuable compute budget.
To solve this, our proxy is fully Abort-Aware. It captures the browser's native AbortSignal and propagates it down the fetch chain to Modal, preventing redundant GPU billing:
========================================================================================
ABORT-AWARE CLIENT-PROXY LIFECYCLE
========================================================================================
[User Click 1] ===(Request A)====> [Next.js Gateway] ===(Fetch A)===> [Modal GPU Worker]
|
[User Click 2] ===(Request B)====> [Next.js Gateway] ===(Fetch B)===> [Modal GPU Worker]
|
{Abort Signal A Heard!}
|
v
[Connection A Terminated]
X <=================== [Modal GPU A Aborted]
* Result: Request A is instantly canceled at the API level, and the downstream
GPU worker is ordered to abort, saving 100% of the active GPU container compute.
2. The AI Brain: Serverless GPU Orchestration on Modal
At the core of the backend is Modal, an ultra-fast cloud container platform. Modal lets us define our infrastructure as Python code and scales up dedicated NVIDIA GPU instances in under 5 seconds to handle incoming search spikes, scaling back down to zero when idle.
Our Modal search application consists of two high-performance pipelines:
Phase A: Dense Vector Retrieval (FAISS)
We pre-processed the entire U.S. National Gallery of Art collection, generating high-dimensional multi-modal embeddings for each artwork. These embeddings are stored in a FAISS (Facebook AI Similarity Search) vector index, which performs high-speed L2 normalized nearest-neighbor searches in microseconds.
========================================================================================
PHASE A: HIGH-SPEED VECTOR DISCOVERY
========================================================================================
+-----------------------+
| User Text Query |
| "winter storm, trees" |
+-----------------------+
|
v
+-----------------------+
| Qwen3-VL-2B Embedder |
| (Extracts Semantics) |
+-----------------------+
|
v
+-----------------------+
| 1536-Dimensional |
| Normalized Vector |
+-----------------------+
|
v
+-----------------------+
| FAISS Index Search |
| (L2 Distance Nearest) |
+-----------------------+
|
+---------------+
| |
v v
[Candidate 1] [Candidate 2] ... [Candidate 64]
Score: 0.89 Score: 0.87 Score: 0.72
3. The Magic: Vision-Language Reranking
Standard vector search is incredibly fast, but it has a blind spot: it relies entirely on a single pre-computed vector. It struggles with fine-grained visual constraints, negative exclusions (e.g., "red painting but no blue"), or subtle atmospheric nuances.
To bridge this gap, we introduced a Multi-Modal Reranking Stage using the state-of-the-art Qwen/Qwen3-VL-Reranker-2B model.
========================================================================================
PHASE B: MULTI-MODAL CROSS-ATTENTION
========================================================================================
[ User Text Query ] ===+
|
v
+-----------------------+ Cross- +-----------------------+
| Qwen3-VL-Reranker VLM | <===Attention==> | Actual Art Image |
| (Text Reasoning) | Layers | (High-Res IIIF URL) |
+-----------------------+ +-----------------------+
|
v
[ Alignment Score ]
"Are the visual features of
winter and snow present?"
|
v
[ Reranking Score ]
Instead of evaluating pre-computed vectors, the VLM reranker takes the user's text query and the actual high-resolution artwork image (retrieved dynamically via the museum's high-speed IIIF image server API) and assesses them together.
The Score-Blending Formula
To ensure we get the best of both world-class speed and visual precision, we blend the FAISS vector similarity score with the VLM Reranker's attention score using a customizable blending coefficient ($\alpha$):
$$\text{Final Score} = \alpha \cdot \text{Reranker Score} + (1 - \alpha) \cdot \text{FAISS Normalized Similarity}$$
========================================================================================
SCORE BLENDING & RERANK INTEGRATION
========================================================================================
[ FAISS Sim Score: 0.89 ] =======( Weight: 40% )=======\
|====> [ Blended Score: 0.926 ]
[ VLM Reranker Score: 0.95 ] =====( Weight: 60% )=======/
* Note: The VLM score dominates the final ranking, ensuring that the highest visual
and composition matches are promoted to the very top of the search grid.
This hybrid scoring ensures that results are mathematically anchored in the broad collection metadata while being visually fine-tuned by the VLM's real-time optical reasoning.
Why This Matters
By building a multi-modal bridge between Next.js and Modal-hosted VLMs, Semantic Art achieves an unprecedented standard of search quality:
- Visual Literacy: The engine successfully matches abstract concepts like "melancholy mood," "sharp high-contrast chiaroscuro," or "fluid brushstrokes."
- Zero-Cold-Starts: The system stays fast and highly responsive with proactive warming and aggressive client-side caching.
- Serverless Cost-Efficiency: We get the power of multi-billion parameter deep learning models on state-of-the-art GPUs, but only pay for the exact sub-seconds that users are actively searching.
Explore the collection yourself and see our multi-modal brain in action on the Semantic Art Search Page!