The Reranker: Why the First Vector Match Isn't Always Right
FAISS finds the neighborhood fast. The cross-encoder reranker finds the right result within it. Here's why two-stage retrieval produces better search quality than a single nearest-neighbor pass — and how Retrievals implements it.

The Reranker: Why the First Vector Match Isn't Always Right
The architecture post described Retrievals' two major paths: an offline index build and a live search runtime. What it mentioned but didn't fully explain is the reranking pass that sits between the FAISS search and the final results.
Reranking is not decoration. It's the part of the pipeline that makes the difference between a search that finds similar works and one that finds the right works.
The Problem with Pure Vector Search
FAISS is fast. It can search 68,000 vectors in milliseconds using approximate nearest-neighbor algorithms. But "nearest in vector space" is not the same as "most relevant to your query."
The embedding model — Qwen3-VL-Embedding-2B — produces 1024-dimensional vectors that encode a compressed representation of each image or text input. Compression means information loss. Two artworks might be close in vector space because they share a dominant color, a compositional structure, or a subject category, even if one is much better aligned with the specific nuance of your query.
Consider the query: "old man reading by candlelight."
FAISS will return a window of candidates that are geometrically close: figures with books, figures in warm light, aged subjects in interior settings. Some of those will be excellent matches. Some will be compositionally similar but tonally wrong — a young scholar in bright daylight, an old man who happens to be holding a candle but whose subject is entirely different. The vectors are close; the relevance isn't.
flowchart TD
A["Query: 'old man reading by candlelight'"]
B["Query embedding\n1024D vector"]
C[("FAISS ANN search")]
D["Top 80 candidates\ngeometric neighbors"]
E["Some excellent matches\nSome false positives"]
A --> B
B --> C
C --> D
D --> E
classDef query fill:#f5f0e8,stroke:#a86845,color:#2c2926;
classDef model fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
classDef index fill:#fcf9f2,stroke:#2c2926,color:#2c2926,stroke-width:2px;
classDef result fill:#fffaf0,stroke:#a86845,color:#2c2926;
class A,B query;
class C index;
class D,E result;
What a Cross-Encoder Does
A cross-encoder is a model that takes two inputs simultaneously — in this case, the query and a candidate image — and produces a relevance score for that specific pair. Unlike the embedding model, which encodes query and image independently and then measures their distance, the cross-encoder can attend to the relationship between them.
This is slower. You can't precompute cross-encoder scores for every query-image pair the way you can precompute image embeddings. But you don't need to run it over the entire collection — only over the candidate window that FAISS already narrowed down.
flowchart LR
A["Query"]
B[("FAISS\ntop 80 candidates")]
C["Cross-encoder\nQwen3-VL image mode"]
D["Relevance scores\nper candidate"]
E["Reranked top 64"]
A --> B
B --> C
A --> C
C --> D
D --> E
classDef query fill:#f5f0e8,stroke:#a86845,color:#2c2926;
classDef index fill:#fcf9f2,stroke:#2c2926,color:#2c2926,stroke-width:2px;
classDef reranker fill:#ebe5d9,stroke:#a86845,color:#2c2926,stroke-width:2px;
classDef result fill:#fffaf0,stroke:#6f685f,color:#2c2926;
class A query;
class B index;
class C reranker;
class D,E result;
The two-stage approach gets the best of both: FAISS gives you speed and scale, the cross-encoder gives you precision within the candidate window.
Retrievals' Reranking Implementation
Retrievals uses Qwen3-VL in image mode as the cross-encoder. The same model family handles both the embedding pass (Qwen3-VL-Embedding-2B) and the reranking pass, which means the model's visual and semantic understanding is consistent across the pipeline.
The reranker runs on the same Modal L40S GPU container as the query embedding. After FAISS returns the top-80 candidate window, each candidate image is scored against the original query. The candidates are then re-sorted by that score, and the top 64 are returned as the first page of results.
flowchart TD
A["User query"]
subgraph Retrieval["Stage 1 — Retrieval"]
B["Qwen3-VL-Embedding-2B\nquery embedding"]
C[("FAISS IVFFlat\ntop 80 candidates")]
end
subgraph Reranking["Stage 2 — Reranking"]
D["Qwen3-VL\ncross-encoder scoring"]
E["Reorder by\nrelevance score"]
end
F["Final results\ntop 64, reranked"]
A --> B
B --> C
C --> D
A --> D
D --> E
E --> F
classDef query fill:#f5f0e8,stroke:#a86845,color:#2c2926;
classDef stage1 fill:#ebe5d9,stroke:#6f685f,color:#2c2926;
classDef stage2 fill:#fcf9f2,stroke:#a86845,color:#2c2926;
classDef result fill:#fffaf0,stroke:#2c2926,color:#2c2926,stroke-width:2px;
class A query;
class B,C stage1;
class D,E stage2;
class F result;
When Reranking Changes the Results
The effect is most visible on queries with semantic nuance that the vector distance doesn't fully capture.
Compositional mimics. Some artworks look similar at the vector level — same lighting style, same subject category — but differ in the specific quality the query is asking for. "Old woman with a melancholy expression" and "old woman in a domestic interior" might surface overlapping candidates from FAISS; the cross-encoder separates them because it can evaluate the expression directly against the query.
Multi-condition queries. "Blue and gold, Byzantine, devotional" combines color, period, and function. FAISS finds the geometric neighborhood of that combination. The reranker scores each candidate on how well it satisfies all three conditions simultaneously, not just how close it is to the average.
Abstract mood queries. "The weight of grief" is abstract enough that vector distance is a rough signal. The reranker, attending to the full query-image relationship, can surface works where the visual evidence of that mood is strongest in the candidate window.
In practice, the first two to four positions shift most noticeably. Results that were geometrically close but relevance-wrong fall back; results that were slightly further in vector space but semantically precise move forward.
The Cost Trade-off
Reranking adds latency. Running Qwen3-VL over 80 image-query pairs takes more time than a single FAISS lookup. On Modal L40S hardware with the model already loaded, the reranking pass adds roughly 0.8–1.2 seconds to the search request.
For an application where search quality is the point — where "close enough" isn't the standard — that trade-off is worth it. The alternative is faster results that more often miss what the query was actually asking for.
The architecture is designed to make this cost predictable: FAISS bounds the candidate window, the reranker operates on a fixed set, and the final result set is capped at 64. There's no feedback loop that could cause the reranker to run unboundedly long.
The full pipeline — embedding, FAISS, reranking — produces a search system where what you describe and what you get are as closely aligned as the current state of multimodal models allows.