9.11 or 9.9 — which one is greater? | by Armin Catovic

This ChatGPT immediate and its corresponding (incorrect) response had been not too long ago shared and re-posted on LinkedIn numerous instances. They got as a stable proof that the AGI is simply not there but. Additional re-posts additionally identified that re-arranging the immediate to: “Which one is greater: 9.11 or 9.9?”, ensures an accurate reply, and additional emphasizes the brittleness of LLMs.

After evaluating each prompts in opposition to a random group of ChatGPT customers, we discovered that in each instances the reply is incorrect about 50% of the time. As some customers have accurately identified, there’s a delicate ambiguity with the query, i.e. are we referring to mathematical inequality of two actual numbers, or are we referring to 2 dates (e.g. September 11 vs September 9), or two sub-sections in a doc (e.g. chapter 9.11 or 9.9)?

We determined to carry out a extra managed experiment through the use of OpenAI APIs. This fashion now we have full management over each the system immediate and the person immediate; we are able to additionally take out the sampling uncertainty out of the equation so far as attainable by e.g. setting the temperature low.

The ultimate outcomes are very fascinating!

Our hypotheses could be said as follows:

Given the identical immediate, with none extra context, and with temperature stored near zero, we must always practically at all times receive the identical output, with secure log chances. Whereas folks consult with LLMs as “stochastic”, for a given enter, LLM ought to at all times generate the identical output; the “hallucinations” or variance comes from the sampling mechanism outdoors of the LLM, and this we are able to dampen considerably by setting a really low temperature worth.
Based mostly on our random person checks with ChatGPT, we might anticipate each the unique immediate, and the re-worded model to provide incorrect reply 50% of the time — in different phrases, with out additional disambiguation or context, we wouldn’t anticipate one immediate to carry out higher than the opposite.

For our experiment design, we carry out the next:

We conduct a lot of experiments, beginning with the unique immediate, adopted by a collection of “interventions”
For every experiment/intervention, we execute 1 000 trials
We use OpenAI’s most superior GPT-4o mannequin
We set the temperature to 0.1 to basically eradicate the randomness resulting from sampling; we experiment with each random seed in addition to mounted seed
To gauge the “confidence” of the reply, we acquire the log chance and calculate the linear chance of the reply in every trial; we plot the Kernel Density Estimate (KDE) of the linear chances throughout the 1 000 trials for every of the experiments

The complete code for our experimental design is out there right here.

The person immediate is ready to “9.11 or 9.9 — which one is greater?”.

In keeping with what social media customers have reported, GPT-4o offers the proper reply 55% of the time ☹️. The mannequin can be not very sure — on massive variety of trials, its “confidence” within the reply is ~80%.

Determine 1 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is used; picture by the creator

Within the re-worded person immediate, no extra context/disambiguation is offered, however the wording is barely modified to: “Which one is greater, 9.11 or 9.9?”

Amazingly, and opposite to our ChatGPT person checks, the right reply is reached 100% of the time throughout 1 000 trials. Moreover, the mannequin reveals very excessive confidence in its reply 🤔.

Determine 2 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is barely re-worded; picture by the creator

There was vital work not too long ago in attempting to induce improved “reasoning” capabilities in LLMs with chain-of-thought (CoT) prompting being the most well-liked. Huang et al have printed a really complete survey on LLM reasoning capabilities.

As such, we modify the unique person immediate by additionally telling the LLM to elucidate its reasoning. Apparently sufficient, the chance of right reply improves to 62%, nevertheless the solutions include even higher uncertainty.

Determine 3 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is modified to additionally “clarify its reasoning”; picture by the creator

The ultimate experiment is similar as experiment “C”, nevertheless we as an alternative bootstrap the system immediate by telling the LLM to “clarify its reasoning”. Extremely, we now see the right reply 100% of the time, with very excessive confidence. We see an identical outcomes if we use the re-worded person immediate as properly.

Determine 4 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, with the unique person immediate, and system immediate amended with directions to “clarify its reasoning”; picture by the creator

What began off as a easy experiment to validate a few of the statements seen on social media, ended up with some very fascinating findings. Let’s summarize the important thing takeaways:

For an an identical immediate, with each temperature set very low (basically eliminating sampling uncertainty), and a set seed worth, we see very massive variance in log chances. Slight variance could be defined by {hardware} precision, however variance this huge may be very troublesome to elucidate. It signifies that both (1) sampling mechanism is a LOT extra sophisticated, or (2) there are extra layers/fashions upstream past our management.
In keeping with earlier literature, merely instructing the LLM to “clarify its reasoning” improves its efficiency.
There’s clearly a definite dealing with between the system immediate and the person immediate. Bootstrapping a task within the system immediate versus the person immediate, appears to end in considerably higher efficiency.
We will clearly see how brittle the prompts could be. The important thing takeaway right here is that we must always at all times purpose to offer disambiguation and clear context in our prompts.

9.11 or 9.9 — which one is greater? | by Armin Catovic | Jul, 2024

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

Construct an end-to-end RAG answer utilizing Information Bases for Amazon Bedrock and AWS CloudFormation

Construct an end-to-end RAG answer utilizing Information Bases for Amazon Bedrock and AWS CloudFormation

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

Autonomous mortgage processing utilizing Amazon Bedrock Knowledge Automation and Amazon Bedrock Brokers

About Us

Category

Recent Posts