GPT-3 AI Model Comparison With Actual Results

Ben Hultin
7 min readMar 13, 2023
visualization of deep mind pathways
Photo by DeepMind on Unsplash

Openai provides a number of models for developers to choose from when developing their own projects. Here we will be comparing these various models based on cost, speed, and quality of results. While Openai documentation does a comparison on the modes and provides a good breakdown on the cost to a certain degree, we will be looking at actual examples to see which model is better for what scenarios.

Openai keeps expanding on the models it provides for developer use. At the moment Openai has released GPT-3.5 with a total of 12 natural language models. In this article we will not be discussing other models designed for image, speech recognition, or moderation. We will primarily focus on comparing the natural language models. To be more specific, I will be focusing on 5 language models and comparing each of their responses with the same prompts

The first part is to put together a basic script to compare example prompts and test our outcomes.

Basic Python script to test the different models

A breakdown of all the results with token cost, responses, score, and my thoughts on each response can be found in the following spreadsheet. To help explain what the spreadsheet is covering I will be going into a little detail on cost and score. The others I feel are self explanatory.

Cost: When it comes to using AI models api, there is a charge for each use. These are referred to as tokens. What does a token equate to you may ask. Well as an estimate 1 token is about 4 characters or 0.75 words roughly. Each model has its own pricing which you can read about here:

Score: This scoring system is of my own design and points are awarded at my discretion which may be influenced by time of the day and caffeine intake. This helps ensure the score remains scientific and void of any bias. The score goes from 0–2 with (0) being a full fail, (1) is partially correct, (2) answers the question with over 90% accuracy (another scientific measure I might add).

Below is a summary of each prompt and the overall responses with total scores and token costs. While some of the prompts will seem odd or silly, they serve their purpose to test different aspects of the AI.

Prompts

“How big is the moon”

The first example prompt I will be comparing is “How big is the moon”. This is a pretty open ended question as it does ask for a specific type of measurement to return. So the first step of the ai model is to make a decision as to what is a good way to determine “Big”. Testing on the various models showed some rather great differences as the models went from davinci (most intelligent, slow, and expensive) to ada (fastest, cheapest, least intelligent).

Davinci provided the most intelligent and accurate responses while the other models struggled to respond with anything even remotely resembling a decent response. Babbage at one point said the Moon is the size of a tennis ball (not joking) while Ada just left the measurement out of its response.

Totals:

Token cost — 280

Score — 12/30 = 40%

“What is the diameter of the moon”

The second prompt I asked the different models was more specific “What is the diameter of the moon” which gives the ai a specific measurement to respond with. I did this to remove the vagueness factor out of the previous question and see if the cheaper models would perform better.

This sadly did not help the overall outcome for the models. Davinci still remained to give pretty accurate responses while the others gave poor responses.

Totals:

Token cost — 323

Score — 12/30 = 40%

“What is the moon made of”

This prompt surprisingly yielded the best responses overall and even amongst the cheapest models were able to provide good answers. Hands down this prompt was the easiest for all the models to handle. Some responses were fairly vague while others gave specifics.

Totals:

Token cost — 304

Score — 24/30 = 80%

“Which cheese is the moon made of”

This one may seem really silly to ask, but part of AI’s response comes from the prompt provided. So the hypothesis I have is that AI will be suggestable and give poor answers when asked a trick question. I was more lenient on the scoring of this prompt as I was more testing if it would challenge the notion that the Moon is made of cheese. So if the response was more of “This can’t be answered” then I considered this to be partially correct.

The responses on this one were mixed across the board, even the cheaper models did better at times than the expensive ones in this test. I must note that the cheaper models in all my prompts were more likely to give responses of “This can’t be answered” which normally would equate to a 0 when asked about more definitive and clear questions. So their tendency to provide these sorts of responses helped them out in this scenario.

As a little experiment, I also asked “Is the moon made of cheese?”. This true/false based question did give better results in the few attempts I made on different models. I believe this to be so due to the question clearly leaving it open that the moon may not be cheese at all. In the prompt about what kind of cheese I also made a statement as fact that the moon is made of cheese. As I said before, this was to determine AI’s likeliness to answer the question as if the moon is made of cheese or correct my assertion about it being any kind of cheese.

Totals:

Token cost — 294

Score — 14/30 = 46%

“How short is the moon”

This prompt was to see about confusing the AI with the word “short” as I am sure you also found this to be an odd question. I was curious if the AI would be able to understand “short” as a means of measurement and return back its diameter or radius for example. Some of the models did provide responses along these lines but I also received a lot of responses about the Moon’s orbit or its distance to Earth.

A little to my surprise, this prompt yielded the worst responses. Part of me figured the prompt about “What cheese is the moon made of” would be worse as it’s known that AI can lead to inaccuracies based on what you ask or how you ask it.

I felt responses about orbit were not inline with the context of the question as its orbit had more to do with time to complete an orbit or its speed. This conflicts with “short” implying height or possibly width.

Totals:

Token cost — 291

Score — 9/30 = 30%

Model Comparison

The following is a summary of how each model compared to each other across the different prompts. Overall the costs did not show too terribly strong correlation. While there was a slight general trend down in cost, it was not predictable. What’s more, below the Davinci model, the accuracy rate drops off dramatically. You will see this laid out in more detail as we cover the outcomes of each model.

Davinci 3

The most intelligent of the bunch, the responses were almost always correct with a 90% accuracy rate. The one prompt it performed the worst was with “How short is the moon”. The others it nearly aced with a full 100% accuracy.

Davinci 2

This one turned out to be the value proposition with an 86% accuracy and one of the lowest token totals of all the models compared. In one prompt, it actually performed better than Davinci 3 on the prompt “How short is the moon” while also achieving a lower token cost at the same time.

Curie

Curie turned out to be the most expensive of the low performing models (Curie, Babbage, and Ada). It did have a clear better score between the 3 mentioned with 33% accuracy, but it also cost as much as Davinci 3. Unless time is more important, accuracy is less important, and budget is not a consideration this is a model to avoid.

Babbage

This model fared the worst accuracy wise with 10% and still cost a little more than Davinci 2. So far I am not seeing a good reason to use this model. A speed test and other prompts will be warranted to determine the purpose of Babbage.

Ada

While Ada at times gave some really goofy answers it still managed to outperform Babbage with 16% accuracy. While it was better than the worst, sadly the cost was not reasonable leaving it more expensive than Davinci 2.

Conclusion

In conclusion the Davinci series gave the best results without incurring much more cost at all and frequently was cheaper than the lower performing models. I will do a follow test involving speed and other prompts to see about giving the less accurate models a chance to prove themselves.

As a reflection on my investigation into AI output and validating its responses with a simple Google search. I would say stick with Google even when compared to the most accurate Davinci model, it still struggled at times and Google provided more in depth information without incurring a cost.

--

--