Photorealistic images of nonsensical scenarios are an astounding technical achievement according to Google. But it says its AI system Imagen is not ready for public use, thanks in part to cultural bias in the imagery.
Instructions to produce images such as “teddy bears swimming at the Olympics 400m Butterfly event” (pictured below) or “A giant cobra snake on a farm. The snake is made out of corn.” sound like the work of human artist Jim’ll Paint It. However, they’re actually the work of Google’s Imagen, a challenger to several other systems, most notably Open AI’s DALL·E2.
While computers can produce such images much quicker than a human, there’s a definite challenge to the AI. It needs to work out exactly what the instructions mean, including identifying the visual components and the way they need to fit together. It then needs to create the components of the image from its reference library and then arrange them in the correct manner. Finally, it needs to make sure the parts of the image fit together in a “realistic manner” with appropriate shadows and other visual effects.
According to a research paper from Google, scaling the text encoder, which figured out exactly what the image should be, had a bigger effect on results than scaling the diffusion model, which created the imagery. As you’d expect, it says its studies showed human rates were more likely to find Imagen’s images more realistic and closer to the text input than those from other systems.
However, it won’t be releasing the code or running a public demonstration at the moment for two reasons. First, it wants to explore the potential for how misuse could impact society. Second, it fears the dataset it used means the system may still have:
an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes… even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects.