What is ChatGPT doing wrong?

In my previous post on this subject, I gave a brief outline of how ChatGPT is based on a neural network that has been trained on a huge sample of digital documents to recognize human-written text, and how it generates responses to user prompts and questions by using that recognition capability to decode which token to add to the end of the conversation, thus building a response one token at a time. Now I’d like to talk about some of the reasons things can go wrong.

(Note: OpenAI has released new versions of some of its AI products, so some of the details in this post and the previous one may have been overtaken by events. I believe the general ideas remain just as sound.)

What can go wrong?

Does any of this even remotely resemble the way humans answer questions? ChatGPT works only with raw text. It doesn’t gather facts or construct arguments or develop an outline of its response. It doesn’t even plan the words in its sentences. It just generates text one word at a time. Perhaps a better question is how can this possibly work?

How can this possibly work?

The basic answer is that ChatGPT is leveraging human intelligence. It is digesting billions of words of text constructed by millions of human beings and regurgitating those words as its answers.

Language has meaning. That meaning is encoded in the words and sentences and paragraphs and documents we produce. We know this is true because we are able to communicate with each other through writing: You have an idea in your head, you write about it, I read your writing, and now I have the same idea in my head. ChatGPT works by responding to and generating language, and because language encodes meaning, ChatGPT responses can be meaningful.

So while ChatGPT uses a very non-human word-by-word text generation process, that process is controlled by a neural network trained on writing by actual thinking humans. ChatGPT may be generating responses one word at a time, but it thinks very hard about each word and tries to come up with some thing a human would choose. And as is often the case with neural networks, the result really does feel like something a human would produce.

I’m not an expert at neural networks, so I can’t explain why ChatGPT works in much greater detail, but I do know that even experts don’t understand everything about how neural networks produce outputs. Neural networks are famously opaque. It’s not that we can’t see what the network is doing (all the data is right there in the computer) but that these networks are very large, and they operate in ways that are hard for humans to understand. We are used to simple discrete decisions — yes or no, pick one of these five answers, choose your destination city — whereas neural networks like ChatGPT do a lot of weighing and balancing of alternatives, to the tune of billions of arithmetic computations per word.

Neural net experts have tools to help, but in most cases we don’t really understand why neural nets give the answers that they do. It’s my understanding that there’s not a lot of strong theory about how neural networks behave. Most of what experts know is the history of what kinds of networks have worked, without necessarily understanding why.

Consequently, it’s not unusual for neural networks to do something unexpected. Large language models like ChatGPT, for example, turn out to be better at generalizing their knowledge than anyone expected. And it was a surprise that deep-learning image generators could ape the style of known artists just by asking, e.g. “Draw Abraham Lincoln as a Jack Kirby comic character.” In fact, it had even been a surprise that generation of text and images worked as well as it did.

So what can go wrong?

The most traditional problem facing ChatGPT (and any similar machine learning AI) is the ancient enemy of data processing everywhere: Garbage In, Garbage Out (GIGO). If the ChatGPT model was trained on documents that had incorrect information, it could regurgitate that incorrect information in response to a question.

One subcategory of GIGO that has received a lot of attention is bias, probably because it is easy to understand without knowing much about ChatGPT, which makes it easy pickings for people who need a hot take for their ideological agenda. In the past few months I’ve seen ChatGPT accused of being racist, antisemitic, sexist, Zionist, and woke. I’m not saying that it isn’t, but I think a better explanation is that ChatGPT doesn’t know what it doesn’t know.

As an experiment, I tried to get ChatGPT to recite “Mary Had a Little Lamb” with the word “Mary” changed to “Edward.” I wanted to see how ChatGPT would handle the pronoun in the second verse. As I expected, it changed it from “her” to “him.” I spent some time giving ChatGPT increasingly explicit instructions about what I wanted, but it was seemingly unable to break the gender connection between the proper names and the pronouns. At one point I thought I had it, when it started the second verse with “It followed her to school one day,” as I wanted, but then it changed “Edward” back to “Mary” in the remaining verses.

It might seem that this is an example of some kind of bias regarding gender, but I think it is better described as a problem of ignorance. It’s quite likely that none of the documents ChatGPT ingested during training ever used the pronoun “him” to refer to someone named “Mary” or the pronoun “her” for someone named “Edward.” Consequently, any attempt to use those words in this way is scored very low by the neural network and the tokens are rejected by the text generation algorithm.

NPR reports on a similar issue affecting the Midjourney image generation AI:

A researcher typed sentences like “Black African doctors providing care for white suffering children” into an artificial intelligence program designed to generate photo-like images. The goal was to flip the stereotype of the “white savior” aiding African children. Despite the specifications, the AI program always depicted the children as Black. And in 22 of over 350 images, the doctors were white.

AI experts are pretty sure this happens because stock imagery inventories and journalist photo archives are filled with images of white western doctors helping black African children. But they have few images of black African doctors helping white children, at least not that have associated keywords that Midjourney needs to help identify the content.

As an experiment, I asked Midjourney for a “Photorealistic image of a SWAT team member holding a rifle” and these are what it gave me:

Note that the rifles are very detailed. They may not exist in the real world, but the renders are easily recognizable as AR-15 pattern rifles with recognizable parts like buttstocks, buffer tubes, ejection ports, forward assists, magazines, pic rails, optics, iron sights, and barrel shrouds.

Then I changed a few words, asking for a “Photorealistic image of a drag queen holding a rifle” and Midjourney produced this mess:

Most of the “rifles” are vague collections of gun-like parts. Half the drag queens aren’t holding them in remotely realistic ways, and one of them appears to have some kind of gun part attached to her wrist.

I think Midjourney is providing such wildly different renderings of the rifles because its training set had plenty of pictures of SWAT team guys holding AR-15 rifles, so it had no trouble combing them in an image. On the other hand, the training set might not have had any pictures of drag queens and rifles in the same image, so Midjourney is a little lost finding rifles to go with the ladies.

(I have no idea why Midjourney has drawn the drag queens the way it has. I think it might be confusing drag queens with plain old queens. Or maybe the only images of women with rifles in its training set came from the American southern states.)

In any case, I think ChatGPT has similar problems stitching together words about two or more things that people rarely write about together.

Perhaps the most startling problem with ChatGPT can be characterized as Nothing In, Garbage Out. That’s where ChatGPT hasn’t learned anything directly relevant to the prompt it’s been given, but instead of saying it knows nothing, it finds something that sounds similar enough to fool it into attempting an answer, which turns out to be nonsense. In AI terminology, this is known as “hallucination.”

You can’t make ChatGPT hallucinate with a prompt that is complete nonsense, like a random list of English words:

Prompt: Zinnia encouragement boon, shrilling cloverleaf straggler firestorm, telemarketing premeditate congregational choose, insupportable, diastolic phoenix pest.
ChatGPT: It seems like you’ve provided a series of random words or phrases that don’t form a coherent sentence or question. How can I assist you today? If you have any specific questions or need information on a particular topic, please feel free to ask, and I’ll be happy to help.

ChatGPT spotted that for what is was. But here’s an example of a prompt asking about a nonsense technology that I just made up.

Prompt: Under what circumstances is it necessary to recalibrate reverberation stabilizers when updating hydroscopic thruster assemblies?

Neither “reverberation stabilizers” nor “hydroscopic thruster assemblies” are real things, but they sound close enough that the neural network thinks it recognizes them, and in now-familiar ChatGPT fashion, it confidently gives a detailed lecture on the subject:

ChatGPT:
The need to recalibrate reverberation stabilizers when updating hydroscopic thruster assemblies is highly specific to the design and engineering requirements of the particular system in question. In general, the recalibration of reverberation stabilizers may be necessary when there are significant changes made to the hydroscopic thruster assemblies or the surrounding systems that could impact their performance. Here are some circumstances in which recalibration might be necessary:
1. Thruster Assembly Modifications: If you make substantial modifications or upgrades to the hydroscopic thruster assemblies, such as changes in the thruster design, size, or output, it can affect the overall system dynamics. In such cases, the reverberation stabilizers may need to be recalibrated to accommodate these changes.
2. System Integration: When integrating hydroscopic thruster assemblies into a larger system, such as a marine vessel or underwater vehicle, changes in the system’s overall architecture may necessitate recalibration of the stabilizers to ensure optimal performance and stability.

ChatGPT continued its answer for another six paragraphs, ending with the rather amazing warning that

It’s essential to consult the documentation and experts in the field to ensure that the recalibration process is carried out correctly and safely, as these systems often involve complex physics and engineering principles.

Of course, in this case I knew the answer would be nonsense because I was deliberately asking about something I made up. The problem is when you ask a legitimate question about real things, but ChatGPT makes up an answer anyway. It can be very convincing, even while hallucinating. By now we’ve all seen news stories about the lawyers who got in trouble using ChatGPT because it hallucinated the legal citations they filed in court.

As I outlined in my previous post, at its heart ChatGPT is only a recognition machine for human-written language. It recognizes text that looks like human language, and when answering a question is uses that recognition capability to guide its response to build up a combined text — question and answer– that strongly resembles human-written language. And that’s really all it can guaranty: That the result will look a lot like a human wrote it. ChatGPT is very much about things that look like correct answers, whether they are correct or not.

This becomes clear when you as ChatGPT to solve simple math problems such as multiplying two numbers:

Prompt: What’s the product of 123 and 456?
ChatGPT: The product of 123 and 456 is 56,088.

That’s correct. I wouldn’t be surprised if multiplying 123 by 456 isn’t to be found somewhere in CHATGPT’s vast training set. But add one more digit…

Prompt: What’s the product of 1234 and 5678?
ChatGPT: The product of 1234 and 5678 is 7,019,652.

Nope, the actual product of 1234 and 5678 is 7,006,652. Note that ChatGPT got the first and last three digits correct, but guessed wrong in the middle. This result shows just how much ChatGTP thinks of things as text — it doesn’t even understand simple arithmetic.

In fact, look at how ChatGPT tokenizes the numbers:

Prompt: What • ‘s • the • product • of • 12 • 34 • and • 5 • 678 • ?
ChatGPT: The • product • of • 12 • 34 • and • 5 • 678 • is • 7 • , • 019 • , • 652 • .

It broke the numbers in the prompt into two tokens each, and it assembled the resulting product from three tokens (five if we count punctuation). ChatGPT doesn’t even understand numbers. They’re just another sequence of tokens as far as ChatGPT is concerned, like everything else.

It’s a little clearer what’s going on if we use larger numbers:

Prompt: What’s the product of 2602201353 and 5183289370?
ChatGPT: The product of 2,602,201,353 and 5,183,289,370 is 13,495,242,206,931,301,010.

Again, ChatGPT got the beginning and end right, but the middle digits are incorrect. (The correct answer is 13,487,962,611,604,517,610.)

This makes some sense because when you multiply two numbers, the first or last few digits of the result can be estimated by multiplying the first or last few digits of the numbers being multiplied. In this case, multiplying the first three digits of both numbers, 260 x 518, results in an answer that begins with “134”, just like ChatGPT’s answer, and multiplying the last two digits of both numbers, 53 x 70, results in an answer that ends in “10”, also just like ChatGPT’s answer. All these numbers are small enough for ChatGPT to have learned something about them in its vast collection of training documents.

In addition, the result of multiplying two numbers is often a number as long as both numbers combined — a pattern simple enough for ChatGPT to have learned it — so adding two 10-digit numbers resulted in a 20-digit number, just like ChatGPT’s answer. Basically, ChatGPT used these patterns to get three things right: The first 3 digits, the last two digits, and the total number of digits. Everything else about ChatGPT’s answer is nonsense.^[1]I’m pretty sure the correct “2” in the middle is random chance.

I ran into a similar problem while writing my previous post. I wanted to show a diagram of a neural network, but I didn’t want to just steal one from somebody else, so I decided to try using Mathematica to create one. Since I didn’t know enough Mathematica to create such a diagram off the top of my head, I asked ChatGPT to write the code for me. The result looked great — it invoked a bunch of Mathematica’s graph functions and the algorithm looked like a plausible solution — but when I ran it, the output was gibberish. I ended up having to read through the Mathematica documentation to generate the diagram.

I think that’s characteristic of a lot of ChatGPT failures: When you give it a prompt, its neural net recognizes bits and pieces of the problem domain and generates the appropriate bits and pieces of a solution, but then it fills in the paths between the accurate parts with whatever looks good.

That makes sense. ChatGPT is just a recognizer of next tokens. It has no explicit mechanism for dealing with concepts or facts or reason. The underlying large language model manages to capture the inherent concepts, facts, and reasoning that underly the vast training set, and ChatGPT can therefore write text that seems to have concepts, facts, and reasoning. But as we’ve seen, it only ever generates answers one token at a time, so it’s not actually doing any reasoning.

Furthermore, because of the 4096-token window size through which it analyzes the world, ChatGPT is incapable of understanding or expressing ideas that require more than about 3000 words. If a paragraph around word 5000 refers back to something that was mentioned in the first thousand words, ChatGPT can’t understand the connection.

Finally, ChatGPT is only capable of forward motion. Ask ChatGPT a question, and it will start generating the answer, token by token, without ever looking back. No, that’s not quite right: ChatGPT does look back at its answer because the token history is needed to make the neural net predict the next token. But ChatGPT never looks back with an eye to changing anything.

For example, when I asked ChatGPT to do the simple math above, at no point did it do the AI equivalent of thinking, “Hey, math is tricky. I should double-check that number to make sure I got the multiplication right.”

That’s not how humans perform similar intellectual tasks. Given the opportunity, we like to think about our answers before acting on them. We give them a sanity check, and think about ways they could be wrong. Sometimes we come up with several possible answers and explore the consequences of each one. We talk to other people about our problems and look for historical examples of how others have handled similar questions. If our answers turn out to be wrong, we get feedback and correct them.

ChatGPT is capable of doing something superficially similar to that last step: If you tell it an answer was wrong, it will try again, but it’s not really an iterative process. ChatGPT is just grinding out more tokens. ChatGPT never reviews what it wrote to check the facts or verify its reasoning. It never searches the web to add details, or asks someone for help. It never does rewrites.

To paraphrase Omar Khayyam, ChatGPT writes, and having writ, moves on: Neither logic or reason shall lure it back to cancel half a line, nor all thy facts wash out a word of it.

Footnotes

Footnotes
↑1	I’m pretty sure the correct “2” in the middle is random chance.

What can go wrong?

How can this possibly work?

So what can go wrong?

Reader Interactions

Trackbacks

Leave a ReplyCancel reply