Forum: AI Generated Art


Subject: Turn words to art - AI

3D-Mobster opened this issue on Jul 09, 2022 ยท 199 posts


parkdalegardener posted Sat, 22 October 2022 at 8:37 AM

randym77 posted at 3:35 PM Thu, 20 October 2022 - #4447150

I have to assume Google is involved somehow. I suppose different AI art programs have different methods.

People have taken to posting images on social media of AI-generated posters for movies. And the text is often similar, but not the same. For example, "The Handmaid's Tale" from Dall-e came out "The Mandal" or "The Thandal."  "The Two Towers" came out "Trill Rows."

The images were a decent match. Women in hoods, or veils for the former, two medieval towers for the later.

These AI programs are pretty tied into pop culture. They will produce celebrities, fictional characters, etc., pretty accurately. In a toon or hand-drawn looking style if you want. Putting in "Star Trek" can generate reasonable likenesses of Kirk and Spock.

OK. I'm going to dumb this down a bit because it is a weird concept for some to grasp. All these programs use a machine learning technique to identify objects in an image by being taught that this image is a particular thing.

Traditional computer vision and image recognition models work the same way. They are taught truth and lie using 100% truth or 100% lie. This picture is an apple. So is this one and this one as well. Three pictures of an apple and all are labelled apple. 100% true. Then you give a picture of an orange labelled as an apple and tell the ai you are training that this is a lie. You have no confidence that your picture of an orange is actually an apple. 

Diffusion models are trained differently. You start with a picture of an apple and add a set amount of noise to your apple image and tell the ai that the apple image and the noisy apple image are the same thing. An apple. Then you add even more noise to the already noisy image and say this is an apple. Rinse and repeat a specific number of times and you are left with an image of just noise that the ai knows as apple.

Still with me? The ai just knows items as noise at this point in the process. Apple, banana, grape, orange. We teach these items to the ai along with a containers such as cups and bowls. To the ai it is just a pile of noise like the static you used to see on an old tv getting a signal with a pair of rabbit ears. Now the fun starts.

You type a text prompt that says you want to see a still life painting of a bowl of fruit. The ai then parses the prompt with CLIP (language training) and attempts to denoise images of apples, bowls, grapes, and the rest of your prompt items from a patch of noise the size of the image. Usually square as all the first step to training any type of ai is consistency of input and most current diffusion models are trained on 512x512 images. Now the ai just denoises whatever it has been trained upon so as to fit into the space. The result is as many different bowls of fruit as want.

The noise in the square is a random pattern based upon the seed value used to generate the image. That is why you get different bowls with different fruit. The ai is looking through the noise pattern to see if it has a bowl, banana, apple, or whatever; that it can denoise to fit the noise in the space. It has learned many different versions of bowl, apple, or whatever in the training so it has many noise patterns to choose from. The "steps" used in the generation of the image are how much denoising happens and thus the juxtaposition of the final elements of your image.


There are other rules involved in the placing of the objects in the scene. There is an aesthetics as part of the learning. That is a big part of the training image classification system and has led to some training bias in the diffusion models. Some things are also specifically sketchy in outputs such as text. Most diffusion setups are not designed to render legible text on purpose.Thus fictional movie posters that almost seem readable.


The image scrapes done for training are not specific to Google. Not by a long shot. The Laion 5B data set has been released to the public. This is the training set for "Stable Diffusion". Thats almost 6 billion pairs of image and text and a search base of 1.6 trillion data sets. Google's is considerably smaller.


The training is not tied to pop culture but to popularity (sure are a lot of apple images here to train on) and aesthetics biases (I like green apples better than red apples.)