https://arxiv.org/pdf/2312.10240
Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions.
However, many generated images still suffer from issues such as artifacts/implausibility, misalignment with text descriptions, and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models,
prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation.
In this paper, we enrich the feedback signal by
(i) marking image regions that are implausible or misaligned with the text, and
(ii) annotating which words in the text prompt are misrepresented or missing on the image.
We collect such rich human feedback on 18K generated images (RichHF18K) and train a multimodal transformer to predict the rich feedback automatically.
We show that the predicted rich human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions.
Notably, the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants)
3. Collecting rich human feedback
3.1. Data collection process
In this section, we discuss our procedure to collect the RichHF-18K dataset, which includes two heatmaps (artifact/implausibility and misalignment), four fine-grained scores (plausibility, alignment, aesthetics, and overall score), and one text sequence (misaligned keywords).
For each generated image, the annotators are first asked to examine the image and read the text prompt used to generate it.
Then, they mark points on the image to indicate the location of any implausibility/artifact or misalignment w.r.t the text prompt.
The annotators are told that each marked point has an “effective radius” (1/20 of the image height), which forms an imaginary disk centering at the marked point.
In this way, we can use a relatively small amount of points to cover the image regions with flaws.
Lastly, annotators label the misaligned keywords and the four types of scores for plausibility, image-text alignment, aesthetic, and overall quality, respectively, on a 5-point Likert scale.
Detailed definitions of image implausibility/artifact and misalignment can be found in the supplementary materials.
We designed a web UI, as shown in Fig. 1, to facilitate data collection.
More details about data collection process can be found in the supplementary materials.
3.2. Human feedback consolidation
To improve the reliability of the collected human feedback on generated images, each image-text pair is annotated by three annotators.
We therefore need to consolidate the multiple annotations for each sample.
For the scores, we simply average the scores from the multiple annotators for an image to obtain the final score.
For the misaligned keyword annotations, we perform majority voting to get the final sequence of indicators of aligned/misaligned, using the most frequent label for the keywords.
For the point annotations, we first convert them to heatmaps for each annotation, where each point is converted to a disk region (as discussed in the last subsection) on the heatmap, and then we compute the average heatmap across annotators.
The regions with clear implausibility are likely to be annotated by all annotators and have a high value on the final average heatmap.
3.3. RichHF-18K:
a dataset of rich human feedback We select a subset of image-text pairs from the Pick-a-Pic dataset for data annotation.
Although our method is general and applicable to any generated images, we choose the majority of our dataset to be photo-realistic images, due to its importance and wider applications
. Moreover, we also want to have balanced categories across the images.
To ensure balance, we utilized the PaLI visual question answering (VQA) model [7] to extract some basic features from the Pick-a-Pic data samples.
Specifically, we asked the following questions for each image-text pair in Pick-a-Pic.
1) Is the image photorealistic?
2) Which category best describes the image?
Choose one in ‘human’, ‘animal’, ‘object’, ‘indoor scene’, ‘outdoor scene’.
PaLI’s answers to these two questions are generally reliable under our manual inspection. We used the answers to sample a diverse subset from Pick-a-Pic, resulting in 17K image-text pairs.
We randomly split the 17K samples into two subsets, a training set with 16K samples and a validation set with 1K samples.
The distribution of the attributes of the 16K training samples is shown in the supplementary materials.
Additionally, we collect rich human feedback on the unique prompts and their corresponding images from the Pick-a-Pic test set as our test set.
In total, we collected rich human feedback on the 18K image-text pairs from Pick-a-Pic.
Our RichHF18K dataset consists of 16K training, 1K validation, and 1K test samples.