in Templated Text Summaries From Data Using ChatGPT I have a quick play seeing if ChatGPT could interpret a CSV data set as data and then generate some simple templated text reports around it.
Grabbing a coffee for another play, this surprised me. In a new thread, I opened with:
Treat the following as a tab separated dataset Bahrain 20 Mar 2022 Charles Leclerc FERRARI 57 1:37:33.584 Saudi Arabia 27 Mar 2022 Max Verstappen RED BULL RACING RBPT 50 1:24:19.293 Australia 10 Apr 2022 Charles Leclerc FERRARI 58 1:27:46.548 Emilia Romagna 24 Apr 2022 Max Verstappen RED BULL RACING RBPT 63 1:32:07.986 Miami 08 May 2022 Max Verstappen RED BULL RACING RBPT 57 1:34:24.258 Spain 22 May 2022 Max Verstappen RED BULL RACING RBPT 66 1:37:20.475 Monaco 29 May 2022 Sergio Perez RED BULL RACING RBPT 64 1:56:30.265 Azerbaijan 12 Jun 2022 Max Verstappen RED BULL RACING RBPT 51 1:34:05.941 Canada 19 Jun 2022 Max Verstappen RED BULL RACING RBPT 70 1:36:21.757 Great Britain 03 Jul 2022 Carlos Sainz FERRARI 52 2:17:50.311 Austria 10 Jul 2022 Charles Leclerc FERRARI 71 1:24:24.312 France 24 Jul 2022 Max Verstappen RED BULL RACING RBPT 53 1:30:02.112 Hungary 31 Jul 2022 Max Verstappen RED BULL RACING RBPT 70 1:39:35.912 Belgium 28 Aug 2022 Max Verstappen RED BULL RACING RBPT 44 1:25:52.894 Netherlands 04 Sep 2022 Max Verstappen RED BULL RACING RBPT 72 1:36:42.773 Italy 11 Sep 2022 Max Verstappen RED BULL RACING RBPT 53 1:20:27.511 Singapore 02 Oct 2022 Sergio Perez RED BULL RACING RBPT 59 2:02:20.238 Japan 09 Oct 2022 Max Verstappen RED BULL RACING RBPT 28 3:01:44.004 United States 23 Oct 2022 Max Verstappen RED BULL RACING RBPT 56 1:42:11.687 Mexico 30 Oct 2022 Max Verstappen RED BULL RACING RBPT 71 1:38:36.729 Brazil 13 Nov 2022 George Russell MERCEDES 71 1:38:34.044 Abu Dhabi 20 Nov 2022 Max Verstappen RED BULL RACING RBPT 58 1:27:45.914
And got this quite surprising reply…
Without an explicit prompt, it seems to have interpreted the data, told me what it relates to, and provided a brief summary of some key featurs in the data. Recall also that the model does not have access to data from 2022, other that what I provided in the prompt.
(At this point, I wonder if I should have prompted ChatGPT to display the data as a tabular data set? Might that have helped its analysis?)
I then asked a very poor question:
Why are those other drivers notable?
Me to ChatGPT
(What I should have prompted was somthing more like: “explain why you said that Sergio Perez, Carlos Sainz and George Russell”.)
I tried to recover the initiative:
You said the drivers were notable. Why did you say that?
Mt to ChatGPT
So how good’s the counting…?
Which team was third in terms of numbers of race wins and how many wins did they get?
Me to ChatGPT
Not very good… it went downhill from there…
And then got worse…
And worse…
And worse…
And then it got to lunch time and ChatGPT lunched out…
PS example of rendering as a tabular data set…
My next prompt would have been something like “Each row in that data table corresponds to a race win. According to that data, how many race wins did Ferrari have?” but it just keeps timing out again…
PS In another session, I asked it to display the first, third and fourth columns as a tabular dataset in the style of a CSV file:
It struggles with telling me how many times Ferrari appears in the dataset, so I try to nudge it along the way of understanding…
Hmmm.. let’s see if we can help it a bit more…
Does that help?
What has it got against Ferrari having won in round 11 (Austria)?
As is stands, I don’t think we can trust it to interpret a dataset we have provided it with. Hmmm.. I wonder…
It was actually 17, but can we get ChatGPT to count the wins out a line at a time…
And when applied to the whole dataset?
So is that handy as a prompt in its own right? Maybe not, ChatGPT appears to prefer the the original CSV data set that it struggles to understand.
So what does it think is in the thirteenth row?
How does it count that?
Let’s try again…
Would it be more reliable if we addressed each row explicitly by a unique key value?
I wonder if we can also improve reliability by generating derived datasets, (like the templated output dataset) and then working with those derived datasets. This would be akin to setting up a data cleaning pipeline and then working with the cleaned data, though we would have to be careful to check the dataset was cleaned correctly, and that we were unambiguous in which dataset we wanted chatGPT to work with at any particular step.
PS to try to improve matters, I wondered: Can We Get ChatGPT to Act Like a Relational Database And Respond to SQL Queries on Provided Datasets and pandas dataframes?