MA-PaPSP

Abstract

Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification.

Approach

MA-PaPSP is a model that complements PaPSP with retrieval augmentation as shown above. The corresponding image-text projections are leveraged to address the two problems mentioned in abstract. Overall, the model retrieves relevant image-text pairs from a large retrieval set using a query image (input image), computes a proxy embedding and contrasts it with embeddings of negative captions. It comprises of two blocks and are as follows:

Proxy embedding: MA-PaPSP first tries to estimate the ground-truth embeddings using a weighted average of retrieved sample embeddings. The weights are determined as the cosine similarity between the query image and the retrieved samples. The result is an embedding of the same size as input image embedding and is called proxy embedding.
Contrastive scores: Then MA-PaPSP contrasts it with negative captions by performing a softmax operation. For prediction tasks like classification and image-text matching, negative captions are already available; however for generation tasks like captioning, negative captions are absent and as a result, MA-PaPSP either uses a simple wordnet like model or a small language model to curate negative captions. The result of this is a score and can be thresholded to perform abstention or prediction.

Top figure demonstrates the two problems mentioned in the abstract and they are 1) instability of the visual-language representations (shown in a,b) and 2) poor calibration of similarity scores (shown in c,d). MA-PaPSP solves this by using the above mentioned blocks. Bottom figure shows the AURC scores for different thresholds. The lower curve indicates better performance.

Qualitative Results

Example demonstrates both MA-PaPSP and PaPSP accept the inputs.

Example demonstrates MA-PaPSP rejects and PaPSP accepts the inputs.

Example demonstrates both MA-PaPSP and PaPSP reject the inputs.

Example demonstrates MA-PaPSP accepts and PaPSP rejects the inputs.

BibTeX

@article{ma-papsp,
  author = {Sarkar, Aditya and Li, Yi and Cheng, Jiacheng and Mishra, Shlok and Vasconcelos, Nuno},
  journal = {ArXiv preprint},
  title = {Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction},
  url = {https://arxiv.org/abs/2601.22570},
  volume = {abs/2601.22570},
  year = {2026}
}

Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction