Gecko Benchmark Rates How Well AI Image Generators Follow Instructions
Researchers at Google’s DeepMind have developed a new system called Gecko to assess how well AI models that create images from text descriptions (text-to-image models) can follow instructions.
Previously, it’s been difficult to compare these models because they’re good at different things. For instance, one model might be better at rendering text clearly, while another excels at depicting objects interacting with each other.
Gecko evaluates AI image generators like a human would
Gecko works by first establishing a set of skills that are important for text-to-image models, such as understanding spatial relationships, recognizing actions, and rendering text. These skills are then broken down into even more specific sub-skills.
Next, the system uses another AI model to create prompts (instructions) that specifically test a text-to-image model’s ability in a particular skill or sub-skill. This helps the creators of the text-to-image model identify areas where their model struggles.
Gecko checks the AI’s work with multiple choice questions
Gecko also assesses how well a text-to-image model follows all the instructions in a prompt. It does this by using another AI model to identify the key details in a prompt and then generate a set of multiple choice questions related to those details. These questions can be about things that are directly visible in the image, or they can be more complex questions that test the model’s understanding of the scene.
Finally, Gecko compares the answers to these questions with human ratings of the generated images. This helps ensure that Gecko’s evaluation method aligns with how people judge the quality of these images.
Gecko identifies Google’s Muse model as superior
The researchers used Gecko to compare several text-to-image models, including Google’s Muse model, Stable Diffusion 1.5, and SDXL. According to Gecko, Muse performed the best.
Overall, it provides a more objective way to assess text-to-image models by evaluating their ability to follow instructions and generate images that align with those instructions.