Exploring the Limitations of AI Image Generators: Can They Count and Write?
We have been amazed by the impressive speed at which generative AI tools like Midjourney, Stable Diffusion, and DALL-E 2 can create stunning images.
Despite their achievements, there is still a confusing gap between what AI image generators can produce and what we can.
For example, these tools often do not produce satisfactory results for seemingly simple tasks such as counting objects and producing accurate text.
If generative AI has reached unprecedented heights in creative expression, why is it struggling with tasks that even a primary school student could complete?
Examining the underlying reasons helps shed light on the complex numerical nature of AI and the nuances of its properties.
Limitations of AI for writing
Humans can easily recognize text symbols written in different fonts and handwriting (such as letters, numbers and signs). We are also able to produce text in different contexts and understand how context can change meaning.
Current AI image generators lack this inherent understanding. They have no real idea of what the symbols in the text mean.
These generators are built on artificial neural networks trained on huge amounts of image data from which they “learn” associations and make predictions.
The shape combinations in the practice pictures are related to different entities. For example, two inward lines that meet can represent the tip of a pencil or the roof of a house.
But when it comes to text and numbers, the associations have to be incredibly precise, because even the smallest flaws are noticeable. Our brains can forget small deviations in the tip of a pen or the roof – but not so much when it comes to the spelling of a word or the number of fingers on a hand.
When it comes to text-to-image designs, text symbols are just combinations of lines and shapes. Because text comes in so many different styles—and because letters and numbers are used in seemingly endless arrangements—the model often doesn’t learn to reproduce text effectively.
The main reason for this is insufficient training data. Artificial intelligence Image generators require much more training data to accurately represent text and numbers than other tasks.
The Tragedy of AI Hands
Problems also arise when working with smaller objects that require intricate details, such as hands.
In training photos, the hands are often small, holding objects or partially covered by other elements. It will be challenging for artificial intelligence to combine the concept of “hand” with an accurate representation of a human hand with five fingers.
As a result, AI-generated hands often appear misshapen, with more or fewer fingers, or with hands partially covered by objects such as sleeves or purses.
We see a similar problem with volumes. Artificial intelligence models lack a clear understanding of quantities, such as the abstract concept of “four”.
As such, the image generator may respond to the prompt “four apples” by utilizing multiple images with multiple amounts of apples and return output with the wrong number.
In other words, the huge diversity of the associations of educational information affects the accuracy of the outcome measures.
Will AI ever be able to write and calculate?
It’s important to remember that text-to-image and text-to-video conversion is a relatively new concept in AI. Current generative platforms are “low resolution” versions of what we can expect in the future.
As training processes and AI technology are developed, future AI image generators will likely be much better able to produce accurate visualizations.
It is also worth noting that most publicly available AI platforms do not offer the highest level. Generating accurate text and amounts requires highly optimized and customized networks, so paid subscriptions on more advanced platforms are likely to yield better results.