Hold my Shiba Inu – businesstraverse.com

The AI ​​world is still figuring out how to handle the amazing display of ability DALL-E 2 offers to draw/paint/imagine just about anything… but OpenAI isn’t alone in working on something like this. Google Research has rushed to reveal a similar model it’s been working on — which it claims is even better.

image (get it?) is a text-to-image diffusion based generator built on large transformer language models that… okay let’s slow down and extract that real quick.

Text-to-image models take text input like “a dog on a bicycle” and produce a corresponding image, something that has been done for years, but has recently made huge strides in quality and accessibility.

Part of that involves using diffusion techniques, which basically start with a pure noise image and slowly refine it bit by bit until the model thinks it can’t look more like a dog on a bike than it already does. This was an improvement over top-down generators that could go hilariously wrong on the first guess, and others that could be easily led astray.

The other part is an improved language comprehension through large language models using the transformer approach, which I cannot (and cannot) get into the technicalities here, but it and a few other recent developments have led to compelling language models such as GPT-3 and others.

Examples of Imagen generated art.

Image Credits: Google Research

Imagen starts by generating a small image (64×64 pixels) and then goes through two “super resolutions” to bring it up to 1024×1024. However, this is not like normal upscaling, as AI super resolution creates new details in harmony with the smaller image, using the original as the base.

For example, suppose you have a dog on a bicycle and the dog’s eye is 3 pixels wide in the first image. Little room for expression! But in the second image it is 12 pixels wide. Where do the details required for this come from? Well, the AI ​​knows what a dog’s eye looks like, so it generates more detail as it draws. Then this happens again when the eye is ready again, but with a diameter of 48 pixels. But at no point did the AI ​​need only 48 pixels of dog’s eye to be out… let’s say magic bag pull. Like many artists, it started with the equivalent of a rough sketch, filled it in with a study, and then really went into town on the final canvas.

This is not unprecedented, and in fact artists working with AI models are already using this technique to create pieces that are much larger than what the AI ​​can handle in one sitting. If you split a canvas into several pieces and super-resolution them all individually, you get something much bigger and more intricately detailed; you can even do it repeatedly. An interesting example from an artist I know:

The claims Google’s researchers claim with Imagen are diverse. They say existing text models can be used for the text encoding part and their quality is more important than just increasing visual fidelity. Intuitively, that makes sense, because a detailed view of nonsense is definitely worse than a slightly less detailed view of exactly what you asked for.

For example in the paper while describing Imagen, they compare the results for it and DALL-E 2 making “a panda making latte art”. In all the latter’s depictions, the latte art is of a panda; in most of Imagen’s it’s a panda that makes the art. (Neither was able to depict a horse on an astronaut, showing the opposite on all attempts. It’s a work in progress.)

Computer generated images of pandas making or being latte art.

Image Credits: Google Research

In Google’s tests, Imagen came out on top in human evaluation tests for both accuracy and reliability. This is, of course, quite subjective, but to even match the perceived quality of DALL-E 2, which until now has been regarded as a giant leap over everything else, is quite impressive. I’ll just add that while it’s pretty good, none of these images (from any generator) will pass more than a cursory examination before people notice they’ve been generated or have serious suspicions.

However, OpenAI is a step or two ahead of Google in a number of ways. DALL-E 2 is more than a research paper, it’s a private beta with people using it just like they used its predecessor and GPT-2 and 3. Ironically, with “open” in its name, the company has focused on producing its text-to-image research, while the fabulously profitable internet giant has yet to try.

That’s more than apparent from the choice the DALL-E 2 researchers made to compile the training dataset in advance and remove any content that might violate their own guidelines. The model couldn’t make anything NSFW if it tried. However, the team at Google used some large datasets known to contain inappropriate material. In an insightful section on the Imagen site describing “Constraints and Societal Impact,” the researchers write:

Downstream applications of text-to-image models are varied and can affect society in complex ways. The potential risks of abuse raise concerns about responsible open sourcing of code and demos. At this point, we have decided not to release any code or a public demo.

The data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, data sets of this nature often reflect social stereotypes, oppressive views, and derogatory or otherwise harmful associations with marginalized identity groups. While a subset of our training data was filtered to remove noise and unwanted content, such as pornographic images and toxic language, we also used the LAION-400M dataset which is known to contain a wide variety of inappropriate content, including pornographic images, racist statements and harmful social stereotypes. Imagen relies on text encoders trained on web-scale uncurated data, thus inheriting the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, leading us to our decision not to release Imagen for public use without further safeguards.

While some may capitalize on this, saying that Google fears its AI may not be politically correct enough, that’s a charitable and short-sighted view. An AI model is only as good as the data it’s trained on, and not every team can spend the time and effort it takes to weed out the really awful things these scrapers pick up when they use millions of images or billion-word data sets.

Such biases are meant to emerge during the research process, exposing how the systems work and providing an unobstructed testing ground for identifying these and other limitations. How else would we know that an AI can’t draw hairstyles that are common in black people – hairstyles that any kid could draw? Or that when asked to write stories about work environments, the AI ​​invariably turns the boss into a man? In these cases, an AI model works perfectly and the way it was designed: it has successfully learned the biases that permeate the media it has been trained on. Not unlike humans!

But while for many people unlearning systemic biases is a lifelong project, an AI has it easier and its creators can remove the content that caused it to behave badly in the first place. Maybe one day it will take AI to write in the style of a racist, sexist pundit of the 1950s, but for now the benefits of including that data are small and the risks high.

In any case, Imagen, like the others, is clearly still in the experimental phase, not ready to be deployed in anything other than strictly human-controlled fashion. As Google makes its capabilities more accessible, we’re sure to learn more about how and why it works.