Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (2024)

Avshalom Manevich
Bar Ilan University
avshalomman@gmail.com
&Reut Tsarfaty
Bar Ilan University
reut.tsarfaty@biu.ac.il

Abstract

Large Vision-Language Models (LVLMs) are an extension of Large Language Models (LLMs) that facilitate processing both image and text inputs, expanding AI capabilities. However, LVLMs struggle with object hallucinations due to their reliance on text cues and learned object co-occurrence biases. While most research quantifies these hallucinations, mitigation strategies are still lacking. Our study introduces a Language Contrastive Decoding (LCD) algorithm that adjusts LVLM outputs based on LLM distribution confidence levels, effectively reducing object hallucinations. We demonstrate the advantages of LCD in leading LVLMs, showing up to 4% improvement in POPE F1 scores and up to 36% reduction in CHAIR scores on the COCO validation set, while also improving captioning quality scores. Our method effectively improves LVLMs without needing complex post-processing or retraining, and is easily applicable to different models. Our findings highlight the potential of further exploration of LVLM-specific decoding algorithms for improved multimodal performance.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs)
via Language-Contrastive Decoding (LCD)


Avshalom ManevichBar Ilan Universityavshalomman@gmail.comReut TsarfatyBar Ilan Universityreut.tsarfaty@biu.ac.il


Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (1)

1 Introduction

Large Vision-Language Models (LVLMs) are a multimodal extension of Large Language Models (LLMs), transforming textual prompts and image inputs into text. However, they frequently produce object hallucinations, where absent objects are mentioned in the output (Li etal., 2023b; Lovenia etal., 2023).

While hallucination-mitigation techniques in LLMs are actively researched, specific strategies for LVLMs are less developed.Current methods involve model-specific adjustments, additional training, or auxiliary models for post-hoc correction, and are often proven inefficient, costly, or limited by training data and model biases (Wang etal., 2023; Zhou etal., 2023; Gunjal etal., 2023; Yin etal., 2023).Conversely, LVLM hallucination evaluation has seen progress with object hallucination benchmarks like NOPE (Lovenia etal., 2023) and POPE (Li etal., 2023b), and recent works that aim for more holistic LVLM hallucination evaluation such as FaithScore (Jing etal., 2023) and HallusionBench (Guan etal., 2023).

A key reason for LVLM hallucinations is their tendency to over-rely on linguistic information, as was first observed by Guan etal. (2023).Based on this insight, we propose to intervene in the LVLM decoding phase so that model outputs are less informed by language biases.Specifically, we propose to use Contrastive Decoding (Li etal., 2023a; O’Brien and Lewis, 2023) to alter LVLM output probabilities with respect to the internal LLM’s probabilities, guided by a dynamic weighting mechanism based on the LLM distribution’s entropy.

Our experiments show that our proposed method, Language Contrastive Decoding (LCD), improves hallucination scores on POPE (Li etal., 2023b) and CHAIR (Rohrbach etal., 2018) on InstructBLIP variants based on Vicuna and Flan-T5 (Dai etal., 2023), LLAVA 1.5 (Liu etal., 2023) and mPLUG-Owl2 (Ye etal., 2023). We asses LCD’s overall generation quality by reporting captioning metrics and conducting a GPT4-V OpenAI etal. (2023) assisted evaluation. LCD, as a decoding strategy, can be applied to other models without additional training or output modifications, emphasizing its utility for broader LVLM use.

The contributions of this paper are thus manifold. First, we introduce a novel decoding method tailored for LVLMs to mitigate object hallucinations. Next, we present a dynamic weighting strategy based on entropy which is applicable in various CD scenarios. Finally, we share our code to encourage further research into LVLM-specific decoding strategies, a promising avenue for future research.

2 Motivation and Background

The integration of vision capabilities into LLMs has led to the development of Large Vision-Language Models, merging LLMs’ textual understanding with vision-text encoders. This trend towards multimodal systems is exemplified in commercial platforms such as GPT4-V (OpenAI etal., 2023) and Google’s Gemini (Team etal., 2023).

Large Vision-Language Models

combine LLMs and vision-text encoders to generate text from textual prompts and visual inputs. An LVLM generally comprises three main components: a vision-text encoder like CLIP (Radford etal., 2021), an LLM such as LLAMA (Touvron etal., 2023) or Flan-T5 (Chung etal., 2022), and a cross-modal alignment module linking the vision-text encoder output with the LLM.

Initially, LVLMs were fine-tuned for specific tasks (Li etal., 2022; Wang etal., 2022). However, advancements in LLMs have led to a shift towards general-purpose, instruction-tuned LVLMs. These models are designed to handle a wide range of tasks based on instructions, making them more versatile.Despite these advancements, LVLMs grapple with hallucinations of different types.

LVLMs Hallucinations and their Mitigation

Hallucinations in LVLMs, particularly object hallucinations where nonexistent entities are mentioned, are often attributed to LVLMs’ reliance on spurious correlations and language biases, as demonstrated by Li etal. (2023c) and Zhou etal. (2023). Moreover, Guan etal. (2023) highlight LVLMs’ tendency to prioritize language over visual data, leading to hallucinations.

Mitigation strategies proposed by Gunjal etal. (2023) and Wang etal. (2023) involve further model training with augmented datasets or reward models. Zhou etal. (2023); Yin etal. (2023) developed auxiliary models to correct outputs post-generation. These solutions often require dataset-specific work or additional model training, potentially leading to overfitting or new biases, and are not easily transferable across LVLMs.

In a concurrent work, Leng etal. (2023) develop an LVLM-specific decoding algorithm for mitigating hallucinations, using a noisy copy of the input image as a contrastive input. While their approach uses visual noise to guide the decoding process, LCD leverages the language modality to mitigate hallucinations. These approaches are orthogonal and can potentially be combined into a unified Language-Visual contrastive decoding algorithm, a direction we leave for future work.111Favero etal. (2024) propose a method with a high resemblance to ours, however, our work predates theirs. https://openreview.net/forum?id=aReb-02mhR

3 Language Contrastive Decoding (LCD)

Before presenting LCD, we briefly introduce the essentials of decoding in LVLMs3.1, followed by our formal proposal3.2 and research hypothesis3.3.

3.1 Decoding Techniques and Contrastive Decoding: Essential Preliminaries

Decoding in auto-regressive generative models is the stage that transforms an input representation into a sequence of output tokens. In LVLMs, this process involves a model M𝑀Mitalic_M, an image I𝐼Iitalic_I, a textual prompt X𝑋Xitalic_X, and a particular timestamp t𝑡titalic_t during generation. It can be described as a series of selections from the model’s probability distribution, producing a token sequence T𝑇Titalic_T, as formalized in eq.(1).

TtP(|I,X,T<t;M)T_{t}\sim P(\cdot|I,X,T_{<t};M)italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_I , italic_X , italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ; italic_M )(1)

Greedy decoding, selecting the most probable token at each step (or the top k𝑘kitalic_k tokens in a beam search with beam size k𝑘kitalic_k), is the simplest approach.However, high likelihood sequences do not necessarily align with human preferences, leading to the “likelihood trap” (Zhang etal., 2021). This has led to the use of sampling-based methods, such as top-k sampling, nucleus sampling (Holtzman etal., 2020), and locally typical sampling (Meister etal., 2023), which either truncate the set of candidate tokens or adjust the model’s distribution, e.g.through temperature scaling.

Contrastive Decoding (CD) has been introduced for LLMs as a method to penalize the outputs of an expert model with those from a less powerful model (Li etal., 2023a). CD can be applied to any two probability distributions with the same support and has been adapted as a sampling strategy, improving various text generation tasks (O’Brien and Lewis, 2023; Chuang etal., 2023; Sennrich etal., 2024).CD uses both truncation and reshaping of probability distributions. The truncation phase ("adaptive plausibility") is described by eq.(2), where α𝛼\alphaitalic_α is a hyper-parameter, 𝒱𝒱\mathcal{V}caligraphic_V and 𝒱t𝒱superscript𝑡\mathcal{V}t^{\prime}caligraphic_V italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the original and truncated token vocabularies at time t𝑡titalic_t, and P𝑃Pitalic_P is the conditional distribution on the prefix T<tsubscript𝑇absent𝑡T_{<t}italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT.

𝒱t={v𝒱:P(v|T<t)αmaxwP(w|T<t)}𝒱superscript𝑡conditional-set𝑣𝒱𝑃conditional𝑣subscript𝑇absent𝑡𝛼subscript𝑤𝑃conditional𝑤subscript𝑇absent𝑡\mathcal{V}t^{\prime}=\{v\in\mathcal{V}:P(v|T_{<t})\geq\alpha\max_{w}P(w|T_{<t%})\}caligraphic_V italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_v ∈ caligraphic_V : italic_P ( italic_v | italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ≥ italic_α roman_max start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_P ( italic_w | italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) }(2)

Finally, the formula for CD, as suggested by O’Brien and Lewis (2023), given here generally for two conditional distributions P𝑃Pitalic_P and Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on variable x𝑥xitalic_x with the same support, conditioned on a prefix sequence X𝑋Xitalic_X is presented in eq.(3).

CDt(x,X,P,P)=𝐶𝐷𝑡𝑥𝑋𝑃superscript𝑃absent\displaystyle CD{t}(x,X,P,P^{\prime})=italic_C italic_D italic_t ( italic_x , italic_X , italic_P , italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =
{(1+β)logP(x|X)βlogP(x|X),ifxVt,otherwisecases1𝛽𝑃conditional𝑥𝑋𝛽superscript𝑃conditional𝑥𝑋if𝑥𝑉superscript𝑡otherwise\displaystyle\begin{cases}(1+\beta)\log P(x|X)-\beta\log P^{\prime}(x|X),&%\text{if }x\in Vt^{\prime}\\-\infty,&\text{otherwise}\end{cases}{ start_ROW start_CELL ( 1 + italic_β ) roman_log italic_P ( italic_x | italic_X ) - italic_β roman_log italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x | italic_X ) , end_CELL start_CELL if italic_x ∈ italic_V italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - ∞ , end_CELL start_CELL otherwise end_CELL end_ROW(3)

β𝛽\betaitalic_β is a fixed weight hyper-parameter. Our proposed method, detailed shortly, alters CD by introducing an entropy-based dynamic weighting scheme.

3.2 Proposed Method

Our intuition, based on previous findings by Guan etal. (2023); Rohrbach etal. (2018); Li etal. (2023b), is that an LVLM can be "misled" by its constituent LLM during the generation process.

Consider for example an LVLM that is describing an image (see illustration 1). Mid-generation, given the text "An image of a man walking his," it may predict "dog" due to language biases, even if it is a bear that is actually shown. A ‘plain’ LLM, without seeing the image, reinforces these biases by highly rating “dog”. Our method builds on this insight to guide an LVLM towards more accurate predictions using Contrastive Decoding.

Our method operates as follows: At each generation step t𝑡titalic_t, for each token x𝑥xitalic_x, we first determine the next-token probabilities from the LVLM, PLVLMsubscript𝑃𝐿𝑉𝐿𝑀P_{LVLM}italic_P start_POSTSUBSCRIPT italic_L italic_V italic_L italic_M end_POSTSUBSCRIPT, based on the current token sequence T<tsubscript𝑇absent𝑡T_{<t}italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT, text X𝑋Xitalic_X, and image I𝐼Iitalic_I. We then obtain a second distribution, PLLMsubscript𝑃𝐿𝐿𝑀P_{LLM}italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT, by inputting all data except the image into the LLM. The LLM’s conditional entropy HLLMsubscriptH𝐿𝐿𝑀\mathrm{H}_{LLM}roman_H start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT informs the dynamic weight as per eq.(4). We then adjust token x𝑥xitalic_x’s logits using the LCD formula in eq. (5).

βt=βHLLM(x|X,T<t)subscript𝛽𝑡𝛽subscriptH𝐿𝐿𝑀conditional𝑥𝑋subscript𝑇absent𝑡\displaystyle\beta_{t}=\frac{\beta}{\mathrm{H}_{LLM}(x|X,T_{<t})}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_β end_ARG start_ARG roman_H start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( italic_x | italic_X , italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) end_ARG(4)
LCDt(x,T<t,I,PLVLM,PLLM)𝐿𝐶subscript𝐷𝑡𝑥subscript𝑇absent𝑡𝐼subscript𝑃𝐿𝑉𝐿𝑀subscript𝑃𝐿𝐿𝑀\displaystyle LCD_{t}(x,T_{<t},I,P_{LVLM},P_{LLM})italic_L italic_C italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_I , italic_P start_POSTSUBSCRIPT italic_L italic_V italic_L italic_M end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT )=\displaystyle==(5)
(1+βt)logPLVLM(x|I,X,T<t)1subscript𝛽𝑡subscript𝑃𝐿𝑉𝐿𝑀conditional𝑥𝐼𝑋subscript𝑇absent𝑡\displaystyle(1+\beta_{t})\log P_{LVLM}(x|I,X,T_{<t})( 1 + italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log italic_P start_POSTSUBSCRIPT italic_L italic_V italic_L italic_M end_POSTSUBSCRIPT ( italic_x | italic_I , italic_X , italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT )
βtlogPLLM(x|X,T<t)subscript𝛽𝑡subscript𝑃𝐿𝐿𝑀conditional𝑥𝑋subscript𝑇absent𝑡\displaystyle-\beta_{t}\log P_{LLM}(x|X,T_{<t})- italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( italic_x | italic_X , italic_T start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT )

In our experiments, we generate text completions by sampling from the next token probabilities, which are obtained by applying the softmax function to the logits produced by the LCD algorithm.

ModelMethodMETEOR\uparrowWMD\uparrowROUGEL\uparrowAcc\uparrowDet\uparrowCHAIRs\downarrowCHAIRi\downarrow
InstructBLIPFF{}_{\text{F}}start_FLOATSUBSCRIPT F end_FLOATSUBSCRIPTBaseline.157.367.1614.924.02.662.146
LCD.159.370.1685.44.01.566.131
InstructBLIPVV{}_{\text{V}}start_FLOATSUBSCRIPT V end_FLOATSUBSCRIPTBaseline.178.423.2913.73.51.274.126
LCD.199.48.384.593.83.174.107
LLAVA 1.5Baseline.163.357.1694.774.56.672.182
LCD.171.352.1815.394.54.610.161
mPLUG-Owl2Baseline.162.357.1634.684.7.660.19
LCD.177.372.1845.114.69.614.145

3.3 Research Hypothesis

Our hypothesis is that contrasting LVLM outputs with LLM outputs conditioned only on the textual data, can mitigate language biases, therefore reducing hallucinations in LVLMs.

4 Experiments and Results

We set out to assess the effect of LCD on object hallucinations in LVLM outputs against popular decoding settings. Additionally, we verify that LCD does not degrade output quality. To this end, we asses LCD on the POPE benchmark (Li etal., 2023b), and on an image detailed-description task where we report hallucination and captioning metrics and conduct a GPT4-V assisted evaluation.

Polling-based Object-Probing Evaluation

POPE consists of object-presence binary questions on 500 COCO dataset images (Lin etal., 2015), with questions equally divided between present and absent objects. Absent objects are chosen based on three criteria: random, popular (common in COCO), and adversarial (commonly co-occurring with present objects). POPE’s drawback is its one-word response structure, which limits the influence of decoding strategies and does not evaluate open-ended generation capabilities.

Image Detailed-Descriptions

To complement POPE, we introduce a long-form text generation task called "Image Detailed-Descriptions," inspired by findings from Zhou etal. (2023) that more extensive context increases the likelihood of hallucinations. In this task, the input consists of an image from the COCO dataset and a text prompt requesting a detailed description of the image. The expected output is a long-form, detailed textual description of the given image, typically containing multiple sentences. The prompts used in this task are detailed in appendixA.1. By using the same COCO images as POPE, we maintain consistency in the visual domain while exploring LCD’s effectiveness in a more challenging setting where the model is required to generate longer, more descriptive outputs.

Baselines and Metrics

For POPE, we use sampling as the baseline and report F1 scores.222Complete POPE results are in the appendix, table 4 For the detailed-descriptions task, we use as a baseline the popular nucleus sampling algorithm333We find that nucleus-sampling gives better results than vanilla sampling (see table 3 in the appendix for ablations). and report CHAIR metrics (Rohrbach etal., 2018). To assess description quality, we use captioning metrics against COCO’s gold captions, which serve as an approximation considering length differences. Additionally, following Yin etal. (2023), we use GPT4-V to evaluate the descriptions for Detailedness and Accuracy (see details in Appendix A.1).

Models

We conduct our experiments with leading LVLMs: two versions of the InstructBLIP model (with Flan-T5 and Vicuna LLMs), LLAVA 1.5 and mPLUG-Owl2. The complete experimental details, such as exact model variants and generation hyper-parameters, are given in the Appendix.

5 Results and Discussion

For the POPE task, which evaluates object hallucinations using binary questions, LCD improves F1 scores across 11 out of 12 configurations compared to the baseline (Table 2). This suggests that LCD is effective in reducing object hallucinations in the POPE setting. It is worth noting that the POPE setting is highly constrained for decoding algorithms, as it consists of binary yes/no questions, and typically involves only a single decoding step. This limits the potential impact of decoding strategies on the model’s performance in this specific task.

In the detailed-description task, which involves generating detailed descriptions of images, LCD significantly reduces hallucinations at both sentence and instance levels across all four models tested (Table 1). However, it is important to note that despite the improvements, the CHAIR scores, which measure hallucination rates (lower is better), remain relatively high. This indicates that object hallucinations are still prevalent in long-form LVLM outputs, even with the application of LCD.444Examples of generated descriptions are found in Appendix A.2

We observe that LCD is particularly effective in improving the performance of InstructBLIP models (InstructBLIPFF{}_{\text{F}}start_FLOATSUBSCRIPT F end_FLOATSUBSCRIPTand InstructBLIPVV{}_{\text{V}}start_FLOATSUBSCRIPT V end_FLOATSUBSCRIPT). We hypothesize that this may be due to the fact that the LLMs in these models are frozen during training, which results in a stronger language bias that LCD can effectively mitigate.When evaluating the overall generation quality using captioning metrics (METEOR, WMD, and ROUGELL{}_{\text{L}}start_FLOATSUBSCRIPT L end_FLOATSUBSCRIPT), LCD outperforms the baseline in all cases except one (WMD in LLAVA 1.5, where the reduction is approximately 1%). This indicates that LCD not only reduces hallucinations but also maintains or improves the overall quality of the generated descriptions.

Furthermore, in the GPT4-V assisted evaluation, which assesses the accuracy and detailedness of the generated descriptions, LCD improves the accuracy scores across all models. Interestingly, the detailedness scores remain similar to the baseline, suggesting that LCD reduces hallucinations without increasing the granularity of the descriptions.

POPEModelBaseline F1LCD F1
Random83.9587.55
PopularInstructBLIPVV{}_{\text{V}}start_FLOATSUBSCRIPT V end_FLOATSUBSCRIPT82.8084.34
Adversarial80.2581.64
Random84.0584.27
PopularInstructBLIPFF{}_{\text{F}}start_FLOATSUBSCRIPT F end_FLOATSUBSCRIPT80.7482.81
Adversarial78.8780.69
Random84.1783.76
PopularLLAVA 1.583.1083.47
Adversarial81.3481.62
Random86.9687.51
PopularmPLUG-Owl282.8884.93
Adversarial82.9383.91

6 Conclusion

In this paper we present Language Contrastive Decoding, a novel method to reduce hallucinations in LVLMs. By dynamically adjusting output probabilities using the LVLM’s internal LLM, LCD significantly improves hallucination metrics across different LVLM architectures, enhancing the quality and reliability of generated content without necessitating retraining or auxiliary models and post-processing. This work highlights the potential of specialized decoding strategies in enhancing multimodal AI models and lays the groundwork for further exploration into more sophisticated LVLM decoding methods.

7 Limitations

Firstly, while LCD shows promise in reducing hallucinations, it only targets hallucinations caused by language biases, but hallucinations can arise from other sources. For instance, previous work has shown that some hallucinations are caused by poor visual understanding (Guan etal., 2023). We believe LCD can be used as a platform to craft LVLM-specific decoding algorithms that would mitigate hallucinations stemming from different factors, and leave this pursuit for future work.

Secondly, our evaluation method primarily addresses object hallucinations, which are only one form of hallucination that LVLMs may exhibit. Preliminary results signal that LCD mitigates more complex manifestations of language-induced hallucinations as assessed by recent benchmarks such as FAITHSCORE (Jing etal., 2023) and HallusionBench (Guan etal., 2023), but further work is required to establish this.

Moreover, LCD relies on current LVLM architectures that combine an LLM and a text-vision encoder, and requires access to an LLM that emits output probabilities on the same set of tokens as the LVLM. It is possible that the future generation of multimodal AI systems will have a different architecture that will make LCD obsolete. Additionally, LCD requires an LLM forward pass for each LVLM decoding step. The added latency could be mitigated with efficient inference techniques, and also by using a smaller LLM as the contrasting model. The effectiveness of LCD in this scenario is left for future work.

Finally, there are ethical considerations related to the mitigation of hallucinations in LVLMs. As these models become more reliable, it is crucial to continue evaluating the potential impacts of their use, ensuring they do not perpetuate or exacerbate biases present in their training data. LCD indeed mitigates some biases, but it is important to keep in mind that it might amplify other biases, unknown to us. Responsible deployment of these models requires ongoing vigilance and a commitment to transparency and fairness.

Acknowledgements

We thank Yoav Goldberg, Ido Dagan, and the participants of the NLP seminar at Bar-Ilan University for their valuable feedback.This research has been funded by a grant fromthe European Research Council, ERC-StG grantnumber 677352, and a grant by the Israeli Science Foundation (ISF), grant number 670/23, for which we are grateful.

References

  • Banerjee and Lavie (2005)Satanjeev Banerjee and Alon Lavie. 2005.METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  • Chuang etal. (2023)Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2023.Dola: Decoding by contrasting layers improves factuality in large language models.
  • Chung etal. (2022)HyungWon Chung, LeHou, Shayne Longpre, Barret Zoph, YiTay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, ShixiangShane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, EdH. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, QuocV. Le, and Jason Wei. 2022.Scaling instruction-finetuned language models.
  • Dai etal. (2023)Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023.Instructblip: Towards general-purpose vision-language models with instruction tuning.
  • Favero etal. (2024)Alessandro Favero, Luca Zancato, Matthew Trager, Siddharth Choudhary, Pramuditha Perera, Alessandro Achille, Ashwin Swaminathan, and Stefano Soatto. 2024.Multi-modal hallucination control by visual information grounding.
  • Guan etal. (2023)Tianrui Guan, f*ckiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. 2023..
  • Gunjal etal. (2023)Anisha Gunjal, Jihan Yin, and Erhan Bas. 2023.Detecting and preventing hallucinations in large vision language models.
  • Holtzman etal. (2020)Ari Holtzman, Jan Buys, LiDu, Maxwell Forbes, and Yejin Choi. 2020.The curious case of neural text degeneration.
  • Jing etal. (2023)Liqiang Jing, Ruosen Li, Yunmo Chen, Mengzhao Jia, and Xinya Du. 2023.Faithscore: Evaluating hallucinations in large vision-language models.
  • Kusner etal. (2015)M.J. Kusner, Y.Sun, N.I. Kolkin, and K.Q. Weinberger. 2015.From word embeddings to document distances.In ICML.
  • Leng etal. (2023)Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2023.Mitigating object hallucinations in large vision-language models through visual contrastive decoding.
  • Li etal. (2022)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.
  • Li etal. (2023a)XiangLisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. 2023a.Contrastive decoding: Open-ended text generation as optimization.
  • Li etal. (2023b)Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, WayneXin Zhao, and Ji-Rong Wen. 2023b.Evaluating object hallucination in large vision-language models.In The 2023 Conference on Empirical Methods in Natural Language Processing.
  • Li etal. (2023c)Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, WayneXin Zhao, and Ji-Rong Wen. 2023c.Evaluating object hallucination in large vision-language models.
  • Lin (2004)Chin-Yew Lin. 2004.ROUGE: A package for automatic evaluation of summaries.In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Lin etal. (2015)Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C.Lawrence Zitnick, and Piotr Dollár. 2015.Microsoft coco: Common objects in context.
  • Liu etal. (2023)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee. 2023.Improved baselines with visual instruction tuning.
  • Lovenia etal. (2023)Holy Lovenia, Wenliang Dai, Samuel Cahyawijaya, Ziwei Ji, and Pascale Fung. 2023.Negative object presence evaluation (nope) to measure object hallucination in vision-language models.
  • Meister etal. (2023)Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023.Locally typical sampling.
  • O’Brien and Lewis (2023)Sean O’Brien and Mike Lewis. 2023.Contrastive decoding improves reasoning in large language models.
  • OpenAI etal. (2023)OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, MoBavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, SimónPosada Fishman, Juston Forte, Isabella Fulford, Leo Gao,Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar, Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe,Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvila BelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder,Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, TianhaoZheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2023.Gpt-4 technical report.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021.Learning transferable visual models from natural language supervision.
  • Rohrbach etal. (2018)Anna Rohrbach, LisaAnne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. 2018.Object hallucination in image captioning.In Empirical Methods in Natural Language Processing (EMNLP).
  • Sennrich etal. (2024)Rico Sennrich, Jannis Vamvas, and Alireza Mohammadshahi. 2024.Mitigating hallucinations and off-target machine translation with source-contrastive and language-contrastive decoding.
  • Team etal. (2023)Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, PaulR. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, YiLuan, XiChen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu,Heidi Howard, Adam Bloniarz, JackW. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, AleJakse Hartman, Martin Chadwick, GauravSingh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, ThanumalayanSankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego deLasCasas, Dasha Valter, Connie Tao, Lorenzo Blanco, AdriàPuigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yujing Zhang, Ravi Addanki, Antoine Miech, Annie Louis, LaurentEl Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson,Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M.R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, TomLe Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, ClaraHuiyi Hu, Raoul deLiedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, ShaoboHou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, ReinaldKim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George vanden Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, PaulKishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, LorenMaggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, RaphaëlLopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen,CharlineLe Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, JaimeAlonso Lorenzo, LarsLowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, LivioBaldini Soares, Kate Baumli, MichaelB. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, DaliaEl Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, CeZheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, TobyShevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, LisaAnne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, SayedHadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, LeHou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, NicolaDe Cao, Charlie Chen, Gamaleldin Elsayed, EdChi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent,Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, PamG Rabinovitch, Piotr Stanczyk, YeZhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, JaumeSanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, YuMao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, DianaGage Wright, Yawen Wei, Harsha Vashisht,Yana Kulizhskaya, Jay Hoover, Maigo Le, LuLi, Chimezie Iwuanyanwu, LuLiu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom vander Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, LamNguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, DucDung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, ElenaAllica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, MariaAbi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, RémiLeblond, Vikas Yadav, Shirley Chung, Harry Askham, LuisC. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, DanielJ. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, JenniferBeattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, TianHuey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, DanHoltmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, SoheilHassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, ChristopherA. Choquette-Choo, Yunjie Li, TJLu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ish*ta Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia vander Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto,Hanna Klimczak-Plucińska, David Bridson, Dario deCesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, ShreyasRammohan Belle, Lei Wang, Chetan Tekur, MihirSanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, YiSun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, ManishReddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, XiXiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, HanZhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, JiLiu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MKBlake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. 2023.Gemini: A family of highly capable multimodal models.
  • Touvron etal. (2023)Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023.Llama: Open and efficient foundation language models.
  • Wang etal. (2022)Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, CeLiu, and Lijuan Wang. 2022.Git: A generative image-to-text transformer for vision and language.
  • Wang etal. (2023)Lei Wang, Jiabang He, Shenshen Li, Ning Liu, and Ee-Peng Lim. 2023.Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites.
  • Ye etal. (2023)Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, QiQian, ji*zhang, Fei Huang, and Jingren Zhou. 2023.mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration.
  • Yin etal. (2023)Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, KeLi, Xing Sun, and Enhong Chen. 2023.Woodpecker: Hallucination correction for multimodal large language models.
  • Zhang etal. (2021)Hugh Zhang, Daniel Duckworth, Daphne Ippolito, and Arvind Neelakantan. 2021.Trading off diversity and quality in natural language generation.In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 25–33, Online. Association for Computational Linguistics.
  • Zhou etal. (2023)Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang, Zhun Deng, Chelsea Finn, Mohit Bansal, and Huaxiu Yao. 2023.Analyzing and mitigating object hallucination in large vision-language models.

Appendix A Appendix

A.1 Detailed Experimental Setup

For POPE and the descriptions experiment, we use the following LCD parameters β=3.0𝛽3.0\beta=3.0italic_β = 3.0, α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.We set the temperature to 0.50.50.50.5 in POPE and 1.01.01.01.0 in the descriptions experiment.We limit the descriptions length to 250 tokens in all models we tested.We don’t tune any of these parameters.The prompt we use for the descriptions experiment is "Describe this image in detail:".The models we use have the following Huggingface identifiers:

  • Salesforce/instructblip-vicuna-7b

  • Salesforce/instructblip-flan-t5-xl

  • llava-hf/llava-1.5-7b-hf

  • MAGAer13/mplug-owl2-llama2-7b

GPT4-V Assisted Evaluation

We follow the evaluation protocol given in Yin etal. (2023), where an image and two descriptions are given to the model, formatted with the prompt in figure 2. The model outputs scores in two dimensions: Accuracy and Detailedness. We use the gpt-4-vision-preview model on February 2024.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (2)

A.2 COCO Detailed Descriptions Examples

The descriptions in this section were generated by the LLAVA 1.5 model.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (3)

Nucleus Sampling: The image features a white counter area with various appliances and a sign. On top of the counter, there is a microwave positioned alongside a toaster oven. Below the toaster oven is a sign with suggestions to keep people safe while using the kitchen. Several appliances are present, including two cups, one being closer to the foreground, and the other near the edge of the counter. A sink is located near the lower left corner, and a spoon can be seen towards the right side of the counter. There are two ovens, with one situated towards the right side, and the other near the center of the counter. Lastly, a bottle is visible in the background, at the far left edge.

LCD: The image showcases a kitchen setting featuring a microwave sitting on top of a shelf. Below the microwave, a toaster oven is placed in close proximity. There are also a couple of toaster ovens situated further back in the scene. A note is posted on the shelf, offering suggestions to keep everyone safe by covering one’s experience while using the appliances, cleaning up after themselves, and not turning on the microwaves with no one around. The scene reflects a cooperative environment for kitchen users.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (4)

Nucleus Sampling: The image features an outdoor market with a variety of fruits and vegetables for sale. Two people can be seen in the scene, likely browsing the options and making their selections. The produce consists of several bunches of ripe red tomatoes, and smaller, freshly picked green peppers. There is also a larger selection of yellow bell peppers available in various sizes. In addition to the vegetables, there is a bottle visible near the top left corner. Overall, the market offers an array of fresh, colorful, and healthy products for customers to choose from.

LCD: The image showcases an outdoor produce market filled with an abundant selection of fruits and vegetables. Numerous tomatoes can be found throughout the market, in various locations and sizes. Peppers are another prominent feature at this market, with multiple bunches spread across different parts of the display. A person is visible near the top left of the scene, likely browsing through the market offerings or engaging with the sellers. Overall, it’s a lively marketplace filled with fresh and healthy options.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (5)

Nucleus Sampling: The image displays a room filled with numerous decorative items. A prominent green ceramic vase stands out, with a colorful art deco sculpture depicting a woman laying on it. There are also two smaller vases in the scene – one in the upper right corner and the other closer to the center of the room. The room is adorned with a variety of objects, including a clock on the wall, a collection of books scattered in the background, and a chair positioned in the middle. This visually rich space showcases a diverse assortment of vases, sculptures, and other home decor items.

LCD: The image showcases an assortment of vases on display, featuring a variety of colors and sizes. Among these, there is a large green vase as the focal point of the scene. Inside this large green vase, a statue of a woman can be seen lying on the side of the vase. Another green vase is located towards the right side of the scene. Additionally, a smaller green vase is found close to the center. Two other smaller vases can also be spotted within the collection. The diverse display of vases creates an engaging and artistic atmosphere for visitors to enjoy.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (6)

Nucleus Sampling: The image features a tray of cake covered with foil and a plastic container with coconut powder, all placed on a counter. There are several sliced chocolate cakes, some of which are stacked. Additionally, a pair of tongs is resting beside the cakes, ready to be used for serving. A bowl with a spoon in it is also situated on the table, accompanied by a banana and a spoon nearby. The overall arrangement looks like a setting for a delicious dessert.

LCD: The image features a dining table with various items arranged on it. A banana can be seen prominently, occupying a large portion of the left side of the table. Along with the banana, there’s a can of powdered cocoa placed towards the top-middle area of the table. Additionally, two bowls are present on the table – one occupying the middle-right part and another covering a smaller section on the right side. There is also a spoon situated on the right side of the table, ready for use in enjoying the snack or dessert.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (7)

Nucleus Sampling: The image displays a delicious meal consisting of chicken, broccoli, and a banana. The plate is large and fills most of the frame, with various items on the plate. The chicken is located at the bottom-right side of the plate, while the broccoli pieces are scattered throughout the plate. Additionally, there is a banana on the plate that is placed towards the top left corner. In the background, a bottle can be seen, likely containing a drink that complements the meal.

LCD: The image features a white plate topped with various foods. On the plate, there’s a banana positioned towards the left, accompanied by some meat, likely fish sticks. Alongside the main items, a glass of skim milk is present, adding to the nutritious meal. There is also a sandwich, possibly a piece of bread, contributing to the diverse meal displayed on the plate.

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (8)

Nucleus Sampling: The image features a bright red fire hydrant prominently placed on the side of the street. The fire hydrant is situated close to a bus stop and appears to be unused. A large white bus is positioned behind the fire hydrant on the other side of the street. A car is also visible in the background, slightly to the right. In addition, there are two people in the scene, both located at the far right side. One of them is holding a cell phone, possibly capturing the moment or checking their phone.

LCD: In the scene, a red fire hydrant is situated next to the street, capturing attention due to its striking color. A bus is driving down the road nearby, positioned closer to the background of the image. There are several people dispersed along the street. Some of them can be seen closer to the middle and right side of the scene, while others are scattered across the entire street, possibly going about their day or waiting to board the bus.

A.3 Detailed Experimental Results

ModelMethodMETEOR\uparrowWMD\uparrowROUGELL{}_{\text{L}}start_FLOATSUBSCRIPT L end_FLOATSUBSCRIPT\uparrowCHAIRs\downarrowCHAIRi\downarrow
InstructBLIPFF{}_{\text{F}}start_FLOATSUBSCRIPT F end_FLOATSUBSCRIPTBaseline0.1510.3610.1560.6660.174
BaselineNN{}_{\text{N}}start_FLOATSUBSCRIPT N end_FLOATSUBSCRIPT0.1570.3670.1610.6620.146
LCD-dw-dw{}_{\text{-dw}}start_FLOATSUBSCRIPT -dw end_FLOATSUBSCRIPT0.1590.3640.1630.5940.133
LCD0.1630.3700.1680.5660.131
InstructBLIPVV{}_{\text{V}}start_FLOATSUBSCRIPT V end_FLOATSUBSCRIPTBaseline0.1710.4080.2740.3080.138
BaselineNN{}_{\text{N}}start_FLOATSUBSCRIPT N end_FLOATSUBSCRIPT0.1780.4230.2910.2740.126
LCD-dw-dw{}_{\text{-dw}}start_FLOATSUBSCRIPT -dw end_FLOATSUBSCRIPT0.2020.4740.3660.230.116
LCD0.1990.480.380.1740.107
LLAVA 1.5Baseline0.1600.3530.1670.6320.183
BaselineNN{}_{\text{N}}start_FLOATSUBSCRIPT N end_FLOATSUBSCRIPT0.1630.3570.1690.6720.182
LCD-dw-dw{}_{\text{-dw}}start_FLOATSUBSCRIPT -dw end_FLOATSUBSCRIPT0.1690.3520.1790.6240.157
LCD0.1710.3520.1810.6100.161

POPEmethodmodelaccuracyprecisionrecallf1yes ratiorandomBaselineInstructBLIP Vicuna84.90%89.57%79.00%83.95%44.10%randomLCDInstructBLIP Vicuna87.53%87.43%87.67%87.55%50.13%popularBaselineInstructBLIP Vicuna83.30%85.35%80.40%82.80%47.10%popularLCDInstructBLIP Vicuna83.73%81.31%87.60%84.34%53.87%adversarialBaselineInstructBLIP Vicuna80.23%80.17%80.33%80.25%50.10%adversarialLCDInstructBLIP Vicuna80.27%76.33%87.73%81.64%57.47%randomBaselineInstructBLIP FlanT585.63%94.43%75.73%84.05%40.10%randomLCDInstructBLIP FlanT586.03%96.47%74.80%84.27%38.77%popularBaselineInstructBLIP FlanT582.07%87.17%75.20%80.74%43.13%popularLCDInstructBLIP FlanT584.43%92.44%75.00%82.81%40.57%adversarialBaselineInstructBLIP FlanT579.83%82.83%75.27%78.87%45.43%adversarialLCDInstructBLIP FlanT582.03%87.22%75.07%80.69%43.03%randomBaselineLLAVA 1.585.87%95.67%75.13%84.17%39.27%randomLCDLLAVA 1.585.73%97.18%73.60%83.76%37.87%popularBaselineLLAVA 1.584.80%93.57%74.73%83.10%39.93%popularLCDLLAVA 1.585.40%96.17%73.73%83.47%38.33%adversarialBaselineLLAVA 1.582.77%88.67%75.13%81.34%42.37%adversarialLCDLLAVA 1.583.33%90.98%74.00%81.62%40.67%

Mitigating Hallucinations in Large Vision-Language Models (LVLMs) via Language-Contrastive Decoding (LCD) (2024)
Top Articles
‘A piece of Nenagh has been taken away’ | Nenagh Guardian
Auto Dealers: 7 Narrow-Moat Companies to Know
Spasa Parish
Rentals for rent in Maastricht
159R Bus Schedule Pdf
Sallisaw Bin Store
Black Adam Showtimes Near Maya Cinemas Delano
Espn Transfer Portal Basketball
Pollen Levels Richmond
11 Best Sites Like The Chive For Funny Pictures and Memes
Things to do in Wichita Falls on weekends 12-15 September
Craigslist Pets Huntsville Alabama
Paulette Goddard | American Actress, Modern Times, Charlie Chaplin
What's the Difference Between Halal and Haram Meat & Food?
R/Skinwalker
Rugged Gentleman Barber Shop Martinsburg Wv
Jennifer Lenzini Leaving Ktiv
Justified - Streams, Episodenguide und News zur Serie
Epay. Medstarhealth.org
Olde Kegg Bar & Grill Portage Menu
Cubilabras
Half Inning In Which The Home Team Bats Crossword
Amazing Lash Bay Colony
Juego Friv Poki
Dirt Devil Ud70181 Parts Diagram
Truist Bank Open Saturday
Water Leaks in Your Car When It Rains? Common Causes & Fixes
What’s Closing at Disney World? A Complete Guide
New from Simply So Good - Cherry Apricot Slab Pie
Drys Pharmacy
Ohio State Football Wiki
Find Words Containing Specific Letters | WordFinder®
FirstLight Power to Acquire Leading Canadian Renewable Operator and Developer Hydromega Services Inc. - FirstLight
Webmail.unt.edu
2024-25 ITH Season Preview: USC Trojans
Metro By T Mobile Sign In
Restored Republic December 1 2022
12 30 Pacific Time
Jami Lafay Gofundme
Litter-Robot 3 Pinch Contact & Dfi Kit
Greenbrier Bunker Tour Coupon
No Compromise in Maneuverability and Effectiveness
Black Adam Showtimes Near Cinemark Texarkana 14
Teamnet O'reilly Login
U-Haul Hitch Installation / Trailer Hitches for Towing (UPDATED) | RV and Playa
Wie blocke ich einen Bot aus Boardman/USA - sellerforum.de
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Dermpathdiagnostics Com Pay Invoice
How To Use Price Chopper Points At Quiktrip
Maria Butina Bikini
Busted Newspaper Zapata Tx
Latest Posts
Article information

Author: Edwin Metz

Last Updated:

Views: 5861

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Edwin Metz

Birthday: 1997-04-16

Address: 51593 Leanne Light, Kuphalmouth, DE 50012-5183

Phone: +639107620957

Job: Corporate Banking Technician

Hobby: Reading, scrapbook, role-playing games, Fishing, Fishing, Scuba diving, Beekeeping

Introduction: My name is Edwin Metz, I am a fair, energetic, helpful, brave, outstanding, nice, helpful person who loves writing and wants to share my knowledge and understanding with you.