Analysing data using ChatGPT
Author: Etienne Oosthuysen
In my previous blog, How will ChatGPT change data analytics, I discuss the potential disruption generative AI, such as ChatGPT, will bring to data analytics. In summary, these impacts are:
- Generative AI models such as ChatGPT may move data analytics even further away from traditional reporting as users may increasingly rely on simply asking the model to interpret large and complex data and the patterns within.
- Data engineers and data scientists will increasingly rely on Generative AI to expedite the code and models required within the data analytic workloads. It may also put these fields in the hands of less formal data engineers, scientists and statisticians.
- Hypothesis and sandbox environments will be easier to spin up due to the ease at which synthetic data can be produced.
In this short article, I show how ChatGPT can be used to access, analyse and interpret data.
Update 16 March 2023 – Microsoft announced Azure Open AI, which gives users access to GPT-3 and DALL-2 inside their tenancies. More on those soon in a dedicated article!
Update 16 March 2023 – it seems as if some functionality mentioned here, notably the ability to access data via public URL has been disabled.
Let’s analyse data with ChatGPT
I will start with some simple examples and then ask ChatGPT to do something more complicated, discuss what this means for corporate and organisational data, and then conclude with some key observations.
Simple example using a global video game sales dataset
For this example, I used a video games sales dataset from Kaggle – Global Video Game Sales | Kaggle.
I initially tried to paste the whole dataset into ChatGPT for analysis but found a limitation. I unsuccessfully tried 16,000, 1,000, and 200 records and eventually got going with 100 records. Clearly not a great representative sample of the original 16,000.
So, I instead added the URL to the dataset containing all the records, and, voila!
Here is a snippet of the 16.6K dataset
Now let’s put ChatGPT through its paces with some simple questions.
ChatGPT Answer: Yes, I can help you analyze the dataset you mentioned. The dataset you referred to contains data on global video game sales from 1980 to 2016. It includes information on the game’s name, platform, year of release, genre, publisher, and sales in various regions of the world.
Here are some initial observations and insights from the dataset:
- The dataset contains 16,598 entries (video games) and 11 columns (variables).
- The platform variable contains information about the console or platform on which the game was released. There are 31 unique platforms in the dataset, with the most common ones being the PlayStation 2, Xbox 360, and PlayStation 3.
- The genre variable contains information about the game’s genre, with the most common ones being Action, Sports, and Misc (Miscellaneous).
- The publisher variable contains information about the company that published the game. The most common publishers in the dataset are Electronic Arts, Activision, and Namco Bandai Games.
- The dataset includes information on video game sales in various regions of the world, including North America, Europe, Japan, and other regions. The total sales column represents the total number of units sold worldwide.
- The dataset includes video games released from 1980 to 2016, with the majority of games released after the year 2000.
- The dataset has missing values in the year, genre, and publisher variables.
Here are some potential questions that could be explored with this dataset:
- What are the top-selling video games of all time, and how much revenue did they generate?
- How have video game sales changed over time, and what factors may have contributed to these changes?
- Which platforms and genres are most popular among gamers, and how do these preferences vary by region?
- Which publishers have the most successful video games, and are there any trends or patterns in their success?
- How well do video games sell in different regions of the world, and are there any notable differences in sales patterns across regions?
Overall, this dataset provides a valuable resource for analyzing trends and patterns in global video game sales, as well as understanding the preferences and behaviors of gamers worldwide.
Given that ChatGPT made some suggestions of questions I could ask of the data, here goes –
ChatGPT Question: Which publishers have the most successful video games, and are there any trends or patterns in their success?
ChatGPT Answer: To explore which publishers have the most successful video games and any trends or patterns in their success, we can analyze the dataset by publisher and look at the number of games published and total sales for each publisher. Here are some observations based on the dataset:
- Electronic Arts (EA) is the publisher with the most games in the dataset, with 1351 games, followed by Activision with 975 games and Namco Bandai Games with 932 games.
- EA is also the publisher with the highest total sales in the dataset, with 1.12 billion units sold, followed by Nintendo with 932 million units and Activision with 721 million units.
- There are some publishers that have consistently performed well over time, such as EA and Nintendo, while others have seen more variability in their success. For example, SquareSoft (now Square Enix) had a high number of sales in the 1990s but has since declined in popularity.
- Some publishers have had success with certain franchises or series. For example, Nintendo has had tremendous success with its Mario and Pokemon franchises, while Activision has had success with its Call of Duty franchise.
- There is some variation in the types of games that different publishers produce, which may contribute to their success. For example, EA has had success with sports games such as FIFA and Madden, while Ubisoft has had success with action-adventure games such as Assassin’s Creed and Far Cry.
Overall, these observations suggest that there are differences in the success of different video game publishers, with some publishers consistently performing well over time and others experiencing more variability in their success. The type of games produced and the success of certain franchises or series may contribute to a publisher’s success.
Let’s try visualising the data.
ChatGPT Question: Can you create a chart of Global_Sales by Genre?
Here a small issue occurred whereby ChatGPT stated that a chart was created, but there was an error.
I have not yet been able to figure out the issue, it might be that imgur.com have blocked ChatGPT, nonetheless…still, some explanatory analysis was provided:
ChatGPT Answer: As we can see from the chart, Action games have the highest global sales, followed by Sports and Shooter games. Role-Playing games and Platform games also have relatively high global sales, while Strategy, Adventure, and Puzzle games have lower global sales. It is important to note that this chart only shows the global sales for each genre and does not take into account any regional differences in sales patterns.
Asking the same question as a table game a perfect result set:
Something a bit more a little bit more complicated
For this example, I used an Olympic Games dataset from Kaggle – Olympics 124 years Dataset(till 2020) | Kaggle
Firstly, a simple question to find out more about the dataset and its metadata.
Chat GPT Question: Tell me about the dataset can you analyse dataset https://www.kaggle.com/datasets/nitishsharma01/olympics-124-years-datasettill-2020
ChatGPT Answer: Sure! The dataset is called “Olympics 124 years dataset till 2020” and it contains information about all Olympic Games from 1896 to 2020. The dataset has 17 columns and 31169 rows, with each row representing a medal-winning athlete in a specific Olympic event.
The columns in the dataset include:
- Athlete: The name of the athlete who won the medal
- Age: The age of the athlete at the time of the Olympic event
- Country: The country of the athlete who won the medal
- Year: The year in which the Olympic event took place
- Closing Ceremony Date: The date of the closing ceremony for the Olympic event
- Sport: The sport in which the athlete won the medal
- Gold Medals: The number of gold medals won by the athlete
- Silver Medals: The number of silver medals won by the athlete
- Bronze Medals: The number of bronze medals won by the athlete
- Total Medals: The total number of medals won by the athlete
- Summer/Winter: Whether the Olympic event was held in the summer or winter
- Gender: The gender of the athlete who won the medal
- Event: The specific event in which the athlete won the medal
- Medal: The type of medal won by the athlete (gold, silver, or bronze)
- Venue: The name of the venue where the Olympic event took place
- City: The city where the Olympic event took place
- Region: The region where the Olympic event took place
With this dataset, it is possible to analyze various trends and patterns in Olympic history, including the performance of individual countries, athletes, and sports over time.
Now let’s do something more complicated:
ChatGPT Question: Can you compare the performance of male and female athletes, looking at the number of medals won by each gender, as well as any trends or differences in performance over time?
ChatGPT Answer: Yes, I can analyze the performance of male and female athletes in the Olympics using the dataset you provided. Here are some insights and trends related to gender differences in Olympic medal-winning:
- Total medals won: According to the dataset, male athletes have won significantly more medals than female athletes. From 1896 to 2020, male athletes won a total of 24,367 medals, while female athletes won 9,606 medals. This means that male athletes have won almost two and a half times more medals than female athletes.
- Distribution of medals: When looking at the distribution of medals won by gender, we can see that male athletes tend to win more gold medals, while female athletes tend to win more silver and bronze medals. This trend holds true across most sports and events.
- Trends over time: When looking at the trends in Olympic medal-winning over time, we can see that the gap between male and female athletes has been narrowing. In the early years of the Olympics, very few events were open to women, which meant that they had fewer opportunities to win medals. However, as more events were added and more countries started sending female athletes, the number of medals won by women has increased steadily over time.
- Recent trends: In recent years, female athletes have been closing the gap with male athletes in terms of overall medal-winning. In the 2016 Rio Olympics, for example, female athletes won almost 45% of the total medals, which is the highest percentage in Olympic history. This trend is expected to continue in the future, as more events become open to women and more countries invest in women’s sports.
Overall, while male athletes have historically won more medals than female athletes, the gap between the two genders has been narrowing over time, and female athletes are becoming increasingly competitive in the Olympics.
What does this mean for corporate and organisational data and how would one go about operationalising ChatGPT in an organisational data analytics environment?
ChatGPT is clearly not going to reach behind the firewall and access data that should not be accessed. And it is not clear what Open AI (and Microsoft) or Google, with their version of ChatGPT, Bard AI, has install for these generative AI engines. Still, one would assume that integration into a corporate environment is at least being considered. Until such time, ChatGPT can be used right now to expedite the code that delivers the analysis above.
Let’s use the final question, “Can you compare the performance of male and female athletes, looking at the number of medals won by each gender, as well as any trends or differences in performance over time?” as an example.
Upon considering the useful analysis of the analysis, I simply requested to see the code used, and ChatGPT obliged 🙂
Operational steps could therefore be:
1) Create an obfuscated version of the dataset you wish to analyse.
2) Let ChatGPT loose on it to explore and test hypotheses.
3) Once you find an analysis to be operationalised.
4) Obtain and use the code within more formal data engineering workloads.
Conclusions
ChatGPT is impressive in terms of how easy and accurately it analysed the datasets I used. But it is still unclear as to how it would deal with complex datasets containing multiple entities in a relationship with one another. So, for now, a final flat data structure would have to be engineered upstream.
Plus, operationalising the logic will require a multi-step process, although, as stated, bringing these generative AI models into the corporate environment seems like a logical future step.
It must be emphasised that I have not yet delved into the security aspects of generative AI, including ChatGPT.
I will next delve into some additional functionality and lift the lid further on ChatGPT and generative AI; stay tuned!
This article was originally published here: https://www.makingmeaning.info/post/analyzing-data-using-chatgpt