Analysing data using ChatGPT
Author: Etienne Oosthuysen
In my previous blog, How will ChatGPT change data analytics, I discuss the potential disruption generative AI, such as ChatGPT, will bring to data analytics. In summary, these impacts are:
- Generative AI models such as ChatGPT may move data analytics even further away from traditional reporting as users may increasingly rely on simply asking the model to interpret large and complex data and the patterns within.
- Data engineers and data scientists will increasingly rely on Generative AI to expedite the code and models required within the data analytic workloads. It may also put these fields in the hands of less formal data engineers, scientists and statisticians.
- Hypothesis and sandbox environments will be easier to spin up due to the ease at which synthetic data can be produced.
In this short article, I show how ChatGPT can be used to access, analyse and interpret data.
Update 16 March 2023 – Microsoft announced Azure Open AI, which gives users access to GPT-3 and DALL-2 inside their tenancies. More on those soon in a dedicated article!
Update 16 March 2023 – it seems as if some functionality mentioned here, notably the ability to access data via public URL has been disabled.
Let’s analyse data with ChatGPT
I will start with some simple examples and then ask ChatGPT to do something more complicated, discuss what this means for corporate and organisational data, and then conclude with some key observations.
Simple example using a global video game sales dataset
For this example, I used a video games sales dataset from Kaggle – Global Video Game Sales | Kaggle.
I initially tried to paste the whole dataset into ChatGPT for analysis but found a limitation. I unsuccessfully tried 16,000, 1,000, and 200 records and eventually got going with 100 records. Clearly not a great representative sample of the original 16,000.
So, I instead added the URL to the dataset containing all the records, and, voila!
Here is a snippet of the 16.6K dataset
Now let’s put ChatGPT through its paces with some simple questions.
Given that ChatGPT made some suggestions of questions I could ask of the data, here goes –
Let’s try visualising the data.
Here a small issue occurred whereby ChatGPT stated that a chart was created, but there was an error.
I have not yet been able to figure out the issue, it might be that imgur.com have blocked ChatGPT, nonetheless…still, some explanatory analysis was provided:
Asking the same question as a table game a perfect result set:
Something a bit more a little bit more complicated
For this example, I used an Olympic Games dataset from Kaggle – Olympics 124 years Dataset(till 2020) | Kaggle
Firstly, a simple question to find out more about the dataset and its metadata.
Now let’s do something more complicated:
What does this mean for corporate and organisational data and how would one go about operationalising ChatGPT in an organisational data analytics environment?
ChatGPT is clearly not going to reach behind the firewall and access data that should not be accessed. And it is not clear what Open AI (and Microsoft) or Google, with their version of ChatGPT, Bard AI, has install for these generative AI engines. Still, one would assume that integration into a corporate environment is at least being considered. Until such time, ChatGPT can be used right now to expedite the code that delivers the analysis above.
Let’s use the final question, “Can you compare the performance of male and female athletes, looking at the number of medals won by each gender, as well as any trends or differences in performance over time?” as an example.
Upon considering the useful analysis of the analysis, I simply requested to see the code used, and ChatGPT obliged 🙂
Operational steps could therefore be:
1) Create an obfuscated version of the dataset you wish to analyse.
2) Let ChatGPT loose on it to explore and test hypotheses.
3) Once you find an analysis to be operationalised.
4) Obtain and use the code within more formal data engineering workloads.
ChatGPT is impressive in terms of how easy and accurately it analysed the datasets I used. But it is still unclear as to how it would deal with complex datasets containing multiple entities in a relationship with one another. So, for now, a final flat data structure would have to be engineered upstream.
Plus, operationalising the logic will require a multi-step process, although, as stated, bringing these generative AI models into the corporate environment seems like a logical future step.
It must be emphasised that I have not yet delved into the security aspects of generative AI, including ChatGPT.
I will next delve into some additional functionality and lift the lid further on ChatGPT and generative AI; stay tuned!
This article was originally published here: https://www.makingmeaning.info/post/analyzing-data-using-chatgpt