[HAI5016] Week 6: GitHub Copilot
Let’s complete our Data Scientist Toolkit with the star of the show: GitHub Copilot. GitHub Copilot integrates seamlessly with Visual Studio Code, providing helpful suggestions as you write code. It helps you write faster, with fewer errors, and with less effort. During today’s class, we’ll learn how to set up and use GitHub Copilot in Visual Studio Code
Disclaimer: This blog provides instructions and resources for the workshop part of my lectures. It is not a replacement for attending class; it may not include some critical steps and the foundational background of the techniques and methodologies used. The information may become outdated over time as I do not update the instructions after class.
1. Setting Up GitHub Copilot in VS Code
To start, we need to install the GitHub Copilot extension and its companion, GitHub Copilot Chat, in Visual Studio Code.
Steps:
Search for GitHub Copilot in the extension pane in VS Code, or click here
Check if the GitHub Copilot icon shows in the status bar below, which indicates that GitHub Copilot is active. Click it to open the Copilot action palette.
Click the GitHub Copilot icon in the status bar to open the Copilot action palette. Here we can find settings to control the Copilot behavior, such as enabling or disabling the suggestions, and find documentation about Copilot.
You should also see the Chat icon in the Activity Bar, which you can use to chat with Copilot.
2. Working with GitHub Copilot: titanic on steroids
Download the titanic training dataset here and save it into your HAI5016 folder
2.1 Loading the Dataset
- Create a new notebook named
titanic-copilot.ipynb
- Add a Markdown cell titled
# Predicting Titanic Survivors
and render it withctrl + Enter
- Create a new code cell, press
ctrl + I
and ask to Copilot:1
Load my titanic dataset *(titanic-data\titanic-train.csv)* into a dataframe
2.2 Exploring the Dataset with Copilot
Once the dataframe is loaded, create new code cells and prompt Copilot to explore the data:
Show me the first 5 records
Show me the last 5 records
Display the column data types
Generate various descriptive statistics on the DataFrame
Adjust the code to your liking before you move on to the next step. Copilot will learn from your input and will provide completions that are more tailored to your own style.
Create a Markdown cell and ask CoPilot:
1
Explain what each column in the dataset represents.
Create a new code cell and ask CoPilot:
1
2
3
4
5
6
Set up a grid of plots and plot the following:
- death and survival counts
- Pclass counts
- Sex counts
- Embarked counts
- Age histogram
2.3 Feature 1: Passenger Class
Open a new Markdown cell and write the header
## Feature: Passenger Class
- Create a new code cell and add the following comment:
1
# From our exploratory data analysis in the previous section, we see there are three passenger classes: First, Second, and Third class. We'll determine which proportion of passengers survived based on their passenger class
Then press crtl + i and ask:
Generate a cross tab of Pclass and Survived
Click to select the cell with the crosstab output. Then, open the Copilot chat and ask:
@workspace /explain please explain me what the crosstab is actually telling me
- Open a new code cell and ask:
Normalize the cross tab to sum to 1, and then plot the cross tab
Can we see if passenger class has a significant impact on whether a passenger survived?
2.4 Feature 2: Sex
Now let’s analyze the effect of gender on survival:
Open a new Markdown cell and write the header
## Feature: Sex
Add a code cell, press
ctrl + i
and ask:Generate a mapping of Sex from a string to a number representation in a new column called "Sex_Val". Map 0 for female and 1 for male
Create a new code cell and ask:
Plot a normalized cross tab for Sex_Val and Survived
Create a new code cell and ask:
Make two plots for each sex that shows the survival rate per Pclass
Who do you think had the highest chances on survival?
2.5 Feature 3: Embarked
The Embarked column might be an important feature but it is missing a couple data points which might pose a problem for machine learning algorithms
- Add a Markdown cell:
## Feature: Embarked
- Create new code cells and ask Copilot to:
Show the records where Embarked is missing
Get the unique values of Embarked and map them to numbers (no decimals) in the column Embarked_Val
Plot the histogram for Embarked_Val
Assign the most common value (‘S’) to missing entries
Plot a normalized cross tab for Embarked_Val and Survived
2.6 Feature 4: Age
The Age column seems like an important feature–unfortunately it is missing many values. We’ll need to fill in the missing values like we did with Embarked.
- Add a Markdown cell:
## Feature: Age
. Then in code cells, prompt Copilot to:- Show a histogram of Age
- Show the records where Age is missing
- Fill missing values with the median age in a new column
AgeFill
- Plot a normalized crosstab of
AgeFill
andSurvived
- Plot AgeFill density by
Pclass
- Analyze age by passenger class
- Make a table with the median age for each Pclass
2.7 Final preparation
Many machine learning algorithms do not work on strings and they usually require the data to be in an array, not a DataFrame.
- Ask Copilot to:
- Show the columns in the dataframe with dtypes object and drop them
- Create a dataframe called df_train with the columns Survived, Pclass, Fare, Sex_Val, Embarked_Val, AgeFill and FamilySize
2.8 Random Forest: Training
Train a machine learning model to predict Titanic survivors. Ask copilot to:
1
2
3
4
Write a Python script using scikit-learn to train a Random Forest classifier.
The classifier should use 100 decision trees (n_estimators=100).
Use df_train where the first column ('Survived') is the target, and the remaining columns are the features.
Fit the model to the training data and calculate the mean accuracy of the classifier.
2.9 Random Forest: Prediction
Download the titanic test dataset here and save it into your HAI5016 folder. Then ask Copilot to:
- Load the test data titanic-test.csv into a DataFrame and fill in missing values
- Make predictions using the Random Forest classifier
- Write the predictions to a CSV file