🎓 Learn Data Analysis with Python and ChatGPT - #10
In the past few weeks, I've noticed a surge in AI-powered tools, apps, and extensions. However, there's not much information on how to use AI for data analysis, specifically with ChatGPT and the new GPT-4 model.
That's why I decided to learn Data Analysis with Python, using only Google Colab and ChatGPT. If you're interested in learning data analysis in just a few minutes without much programming experience, this article is for you.
What Do You Need to Get Started?
A Google Colab account (I tried Jupyter Notebook, but I prefer Colab).
A ChatGPT account (ideally with Plus to use GPT-4, although GPT-3.5 should yield similar results, perhaps slightly less accurate).
A dataset to analyze, preferably with a clear understanding of each column's meaning. The dataset should be in CSV format (comma-separated values). If you use macOS, I recommend uploading your .xlsx file to Google Sheets and then downloading it as a CSV. I had issues converting it to CSV from Excel on macOS.
And that's it! You're ready to begin.
Subscribe to my weekly newsletter for key takeaways on business, productivity, and technology.
For this example, I used a dataset from a dog subscription box business. The data has been randomized for privacy, but the column structure remains the same.
Starting with a Prompt
Begin by sending ChatGPT a prompt with instructions to upload/load the dataset and explain each column's meaning. It should look something like this:
PromptCopy code
"Can you help me upload a dataset to Google Colab and explain what each column represents? Here's the dataset information: 1. timestamp: Date and time of purchase 2. customer_id: Unique customer identifier 3. product: Product purchased 4. quantity: Number of products purchased 5. price: Price of the product 6. discount_code: Discount code used (if any)"
ChatGPT will provide you with a response like this:
ResponseCopy code
"To load the dataset into Google Colab, you can use the following code snippet: ```python import pandas as pd # Load the CSV dataset df = pd.read_csv('dataset.csv') # Display the first few rows to ensure it's loaded correctly df.head()
pythonCopy code
Copy this code and paste it into a new cell in Google Colab. Click the "play" button to import the libraries and read the dataset. Now, you need to specify the correct path for your dataset in Google Colab. If you've uploaded it to a specific folder, you may need to update the code snippet to point to that folder. Change `'dataset.csv'` to your dataset's filename. After running the code, confirm that your dataset has been loaded correctly by checking the first few rows. If everything looks good, you're ready to proceed with your analysis. ## Data Analysis with ChatGPT and Google Colab Here are a few common data analysis tasks you can perform with Python and ChatGPT: ### 1. Data Cleaning and Formatting In my example, the timestamp (date of purchase) wasn't being read as a date. Additionally, I had made the mistake of uploading dates in different formats. Here's how you can ask ChatGPT to fix this: ```Prompt "ChatGPT, I need help fixing the date format in my dataset. The 'timestamp' column has dates in various formats. Can you provide code to standardize the date format?"
ChatGPT will likely respond with something like this:
ResponseCopy code
"To standardize the date format, you can use the following code snippet: ```python # Convert the 'timestamp' column to datetime, inferring the format df['timestamp'] = pd.to_datetime(df['timestamp'], errors='coerce') # Display the first few rows to ensure the conversion worked df.head()
vbnetCopy code
Running this code snippet should convert your dates to a standard format. Make sure to check for any errors or issues after the conversion. ### 2. Analyzing Customer Behavior Here are a few questions you can ask ChatGPT to analyze customer behavior: - What is the relationship between the use of discount codes and the amount paid? - Who are the top 20 customers by number of purchases and total spending? Is there any variable that could explain why they buy or spend more? - What is the average purchase frequency per year for customers who buy at least once? Which plan corresponds to different frequencies? ChatGPT will generate code snippets to answer these questions, allowing you to gain insights quickly. For example, let's analyze the top 20 customers by the number of purchases and total spending: ```Prompt "ChatGPT, can you help me identify the top 20 customers by the number of purchases and total spending? Can you provide code to analyze whether any variable might explain why they buy or spend more?"
ChatGPT's response might look like this:
ResponseCopy code
"To find the top 20 customers by the number of purchases, use the following code snippet: ```python # Group by customer_id and count the number of purchases top_20_customers = df.groupby('customer_id').size().nlargest(20) # Display the top 20 customers by number of purchases top_20_customers
To find the top 20 customers by total spending, use this code snippet:
pythonCopy code
# Group by customer_id and sum the total spending top_20_spenders = df.groupby('customer_id')['price'].sum().nlargest(20) # Display the top 20 spenders by total spending top_20_spenders
vbnetCopy code
After running these code snippets, you can identify the top customers by the number of purchases and total spending. ### 3. Exploring Data Insights Let's explore the relationship between customer behavior and other variables: ```Prompt "ChatGPT, I want to explore whether there's a relationship between customer behavior and other variables. Can you provide code to visualize the data and identify potential patterns?"
ChatGPT's response might suggest using visualization libraries like Matplotlib or Seaborn to create graphs and plots that reveal trends or patterns.
For example, to visualize the relationship between discount codes and the amount paid, you could use this code snippet:
pythonCopy code
import matplotlib.pyplot as plt import seaborn as sns # Create a scatter plot to visualize the relationship sns.scatterplot(x='discount_code', y='price', data=df) plt.title('Relationship Between Discount Codes and Price') plt.show()
This scatter plot helps visualize whether there's a correlation between discount code use and the amount paid.
Build a Predictive Model
To build a predictive model, you'd need a more complex approach. If you're interested in learning how to create a predictive model to forecast customer behavior, let me know, and I'll cover it in a future edition of this newsletter.
Here is a prompt for ChatGPT to build a simple predictive model:
PromptCopy code
"ChatGPT, I'd like to build a simple predictive model to estimate the probability of a customer purchasing in the future. Can you provide code to help me get started?"
ChatGPT might respond with a code snippet to create a basic logistic regression model to predict the probability of future purchases.
Wrap-Up
This approach allows you to explore and analyze data using Python and ChatGPT quickly. Whether you're interested in customer behavior, exploring data insights, or building predictive models, these prompts and code snippets can get you started.
What Would You Do?
Now that you know who your top customers are, what would you do with this information? Alternatively, what would you do with customers who have a higher probability of churn?
🦉 Quote of the Week "We go to work five days per week but spend more than one of those days on average just looking for the information we need to do our work. Half the time, we don’t even succeed in doing that." — Thiago Forte, Building a Second Brain
And that's it for today! If you enjoyed this, let me know in the feedback below. If not, tell me too.
Don't forget to subscribe if you haven't already.
See you next week!