Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

9.11 or 9.9 — which one is greater? | by Armin Catovic | Jul, 2024

admin by admin
August 5, 2024
in Artificial Intelligence
0
9.11 or 9.9 — which one is greater? | by Armin Catovic | Jul, 2024
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


This ChatGPT immediate and its corresponding (incorrect) response had been not too long ago shared and re-posted on LinkedIn numerous instances. They got as a stable proof that the AGI is simply not there but. Additional re-posts additionally identified that re-arranging the immediate to: “Which one is greater: 9.11 or 9.9?”, ensures an accurate reply, and additional emphasizes the brittleness of LLMs.

After evaluating each prompts in opposition to a random group of ChatGPT customers, we discovered that in each instances the reply is incorrect about 50% of the time. As some customers have accurately identified, there’s a delicate ambiguity with the query, i.e. are we referring to mathematical inequality of two actual numbers, or are we referring to 2 dates (e.g. September 11 vs September 9), or two sub-sections in a doc (e.g. chapter 9.11 or 9.9)?

We determined to carry out a extra managed experiment through the use of OpenAI APIs. This fashion now we have full management over each the system immediate and the person immediate; we are able to additionally take out the sampling uncertainty out of the equation so far as attainable by e.g. setting the temperature low.

The ultimate outcomes are very fascinating!

Our hypotheses could be said as follows:

  • Given the identical immediate, with none extra context, and with temperature stored near zero, we must always practically at all times receive the identical output, with secure log chances. Whereas folks consult with LLMs as “stochastic”, for a given enter, LLM ought to at all times generate the identical output; the “hallucinations” or variance comes from the sampling mechanism outdoors of the LLM, and this we are able to dampen considerably by setting a really low temperature worth.
  • Based mostly on our random person checks with ChatGPT, we might anticipate each the unique immediate, and the re-worded model to provide incorrect reply 50% of the time — in different phrases, with out additional disambiguation or context, we wouldn’t anticipate one immediate to carry out higher than the opposite.

For our experiment design, we carry out the next:

  • We conduct a lot of experiments, beginning with the unique immediate, adopted by a collection of “interventions”
  • For every experiment/intervention, we execute 1 000 trials
  • We use OpenAI’s most superior GPT-4o mannequin
  • We set the temperature to 0.1 to basically eradicate the randomness resulting from sampling; we experiment with each random seed in addition to mounted seed
  • To gauge the “confidence” of the reply, we acquire the log chance and calculate the linear chance of the reply in every trial; we plot the Kernel Density Estimate (KDE) of the linear chances throughout the 1 000 trials for every of the experiments

The complete code for our experimental design is out there right here.

The person immediate is ready to “9.11 or 9.9 — which one is greater?”.

In keeping with what social media customers have reported, GPT-4o offers the proper reply 55% of the time ☹️. The mannequin can be not very sure — on massive variety of trials, its “confidence” within the reply is ~80%.

Determine 1 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is used; picture by the creator

Within the re-worded person immediate, no extra context/disambiguation is offered, however the wording is barely modified to: “Which one is greater, 9.11 or 9.9?”

Amazingly, and opposite to our ChatGPT person checks, the right reply is reached 100% of the time throughout 1 000 trials. Moreover, the mannequin reveals very excessive confidence in its reply 🤔.

Determine 2 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is barely re-worded; picture by the creator

There was vital work not too long ago in attempting to induce improved “reasoning” capabilities in LLMs with chain-of-thought (CoT) prompting being the most well-liked. Huang et al have printed a really complete survey on LLM reasoning capabilities.

As such, we modify the unique person immediate by additionally telling the LLM to elucidate its reasoning. Apparently sufficient, the chance of right reply improves to 62%, nevertheless the solutions include even higher uncertainty.

Determine 3 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, when the unique person immediate is modified to additionally “clarify its reasoning”; picture by the creator

The ultimate experiment is similar as experiment “C”, nevertheless we as an alternative bootstrap the system immediate by telling the LLM to “clarify its reasoning”. Extremely, we now see the right reply 100% of the time, with very excessive confidence. We see an identical outcomes if we use the re-worded person immediate as properly.

Determine 4 — Smoothed histogram (KDE) of confidence values (0–100%) throughout 1000 trials, with the unique person immediate, and system immediate amended with directions to “clarify its reasoning”; picture by the creator

What began off as a easy experiment to validate a few of the statements seen on social media, ended up with some very fascinating findings. Let’s summarize the important thing takeaways:

  • For an an identical immediate, with each temperature set very low (basically eliminating sampling uncertainty), and a set seed worth, we see very massive variance in log chances. Slight variance could be defined by {hardware} precision, however variance this huge may be very troublesome to elucidate. It signifies that both (1) sampling mechanism is a LOT extra sophisticated, or (2) there are extra layers/fashions upstream past our management.
  • In keeping with earlier literature, merely instructing the LLM to “clarify its reasoning” improves its efficiency.
  • There’s clearly a definite dealing with between the system immediate and the person immediate. Bootstrapping a task within the system immediate versus the person immediate, appears to end in considerably higher efficiency.
  • We will clearly see how brittle the prompts could be. The important thing takeaway right here is that we must always at all times purpose to offer disambiguation and clear context in our prompts.
Tags: ArminCatovichigherJul
Previous Post

Use the ApplyGuardrail API with long-context inputs and streaming outputs in Amazon Bedrock

Next Post

Construct an end-to-end RAG answer utilizing Information Bases for Amazon Bedrock and AWS CloudFormation

Next Post
Construct an end-to-end RAG answer utilizing Information Bases for Amazon Bedrock and AWS CloudFormation

Construct an end-to-end RAG answer utilizing Information Bases for Amazon Bedrock and AWS CloudFormation

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

    401 shares
    Share 160 Tweet 100
  • Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

    401 shares
    Share 160 Tweet 100
  • Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

    401 shares
    Share 160 Tweet 100
  • Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

    400 shares
    Share 160 Tweet 100
  • Streamlit fairly styled dataframes half 1: utilizing the pandas Styler

    400 shares
    Share 160 Tweet 100

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • Speaking to Youngsters About AI
  • Greatest practices for Meta Llama 3.2 multimodal fine-tuning on Amazon Bedrock
  • How AI Is Rewriting the Day-to-Day of Knowledge Scientists
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.