Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

1University of Maryland College Park, 2University of California San Diego,
3Qualcomm AI 4Yale University 5Meta AI
ICLR 2026

PaPSP uses an external representation model and the CLIP score to enable selective prediction for VLM tasks like captioning without training. MA-PaPSP augments this model with an external dataset, which is leveraged to estimate proxy embeddings of greater stability and better calibrated contrastive scores. The figure shows an example where PaPSP fails but MA-PaPSP succeeds at rejecting an incorrect caption for the image shown. Also shown is the Cider-4 score between predicted and ground truth captions.

Abstract

Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification.

Approach

MA-PaPSP is a model that complements PaPSP with retrieval augmentation as shown above. The corresponding image-text projections are leveraged to address the two problems mentioned in abstract. Overall, the model retrieves relevant image-text pairs from a large retrieval set using a query image (input image), computes a proxy embedding and contrasts it with embeddings of negative captions. It comprises of two blocks and are as follows:
  1. Proxy embedding: MA-PaPSP first tries to estimate the ground-truth embeddings using a weighted average of retrieved sample embeddings. The weights are determined as the cosine similarity between the query image and the retrieved samples. The result is an embedding of the same size as input image embedding and is called proxy embedding.
  2. Contrastive scores: Then MA-PaPSP contrasts it with negative captions by performing a softmax operation. For prediction tasks like classification and image-text matching, negative captions are already available; however for generation tasks like captioning, negative captions are absent and as a result, MA-PaPSP either uses a simple wordnet like model or a small language model to curate negative captions. The result of this is a score and can be thresholded to perform abstention or prediction.
Top figure demonstrates the two problems mentioned in the abstract and they are 1) instability of the visual-language representations (shown in a,b) and 2) poor calibration of similarity scores (shown in c,d). MA-PaPSP solves this by using the above mentioned blocks. Bottom figure shows the AURC scores for different thresholds. The lower curve indicates better performance.

Qualitative Results

Example demonstrates both MA-PaPSP and PaPSP accept the inputs.

Example demonstrates MA-PaPSP rejects and PaPSP accepts the inputs.

Example demonstrates both MA-PaPSP and PaPSP reject the inputs.

Example demonstrates MA-PaPSP accepts and PaPSP rejects the inputs.

Disclaimer

MA-PaPSP is purely a research project. Currently, we have no plans to incorporate MA-PaPSP into a product or expand access to the public. We will also put United States AI principles into practice when further developing the models. In our research paper, we account for the ethical concerns associated with language generation research. To mitigate issues associated with testing data, we have implemented a rigorous filtering process to purge our data of inappropriate content, such as explicit imagery and offensive language, to minimize the likelihood of generating inappropriate content.

BibTeX

@article{ma-papsp,
  author = {Sarkar, Aditya and Li, Yi and Cheng, Jiacheng and Mishra, Shlok and Vasconcelos, Nuno},
  journal = {ArXiv preprint},
  title = {Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction},
  url = {https://arxiv.org/abs/2601.22570},
  volume = {abs/2601.22570},
  year = {2026}
}