Post

[HAI5016] Week 6: GitHub Copilot

[HAI5016] Week 6: GitHub Copilot

Let’s complete our Data Scientist Toolkit with the star of the show: GitHub Copilot. GitHub Copilot integrates seamlessly with Visual Studio Code, providing helpful suggestions as you write code. It helps you write faster, with fewer errors, and with less effort. During today’s class, we’ll learn how to set up and use GitHub Copilot in Visual Studio Code

Disclaimer: This blog provides instructions and resources for the workshop part of my lectures. It is not a replacement for attending class; it may not include some critical steps and the foundational background of the techniques and methodologies used. The information may become outdated over time as I do not update the instructions after class.

1. Setting Up GitHub Copilot in VS Code

To start, we need to install the GitHub Copilot extension and its companion, GitHub Copilot Chat, in Visual Studio Code.

Steps:

  1. Search for GitHub Copilot in the extension pane in VS Code, or click here

  2. Check if the GitHub Copilot icon shows in the status bar below, which indicates that GitHub Copilot is active. Click it to open the Copilot action palette. GitHub Copilot Active

  3. Click the GitHub Copilot icon in the status bar to open the Copilot action palette. Here we can find settings to control the Copilot behavior, such as enabling or disabling the suggestions, and find documentation about Copilot.

  4. You should also see the Chat icon in the Activity Bar, which you can use to chat with Copilot.

2. Working with GitHub Copilot: titanic on steroids

Download the titanic training dataset here and save it into your HAI5016 folder

2.1 Loading the Dataset

  1. Create a new notebook named titanic-copilot.ipynb
  2. Add a Markdown cell titled # Predicting Titanic Survivors and render it with ctrl + Enter
  3. Create a new code cell, press ctrl + I and ask to Copilot:
    1
    
      Load my titanic dataset *(titanic-data\titanic-train.csv)* into a dataframe
    

2.2 Exploring the Dataset with Copilot

Once the dataframe is loaded, create new code cells and prompt Copilot to explore the data:

  • Show me the first 5 records
  • Show me the last 5 records
  • Display the column data types
  • Generate various descriptive statistics on the DataFrame

    Adjust the code to your liking before you move on to the next step. Copilot will learn from your input and will provide completions that are more tailored to your own style.

Create a Markdown cell and ask CoPilot:

1
Explain what each column in the dataset represents.

Create a new code cell and ask CoPilot:

1
2
3
4
5
6
Set up a grid of plots and plot the following:
- death and survival counts
- Pclass counts
- Sex counts
- Embarked counts
- Age histogram

2.3 Feature 1: Passenger Class

  1. Open a new Markdown cell and write the header ## Feature: Passenger Class

  2. Create a new code cell and add the following comment:
    1
    
      # From our exploratory data analysis in the previous section, we see there are three passenger classes: First, Second, and Third class. We'll determine which proportion of passengers survived based on their passenger class
    
  3. Then press crtl + i and ask: Generate a cross tab of Pclass and Survived

  4. Click to select the cell with the crosstab output. Then, open the Copilot chat and ask: @workspace /explain please explain me what the crosstab is actually telling me

  5. Open a new code cell and ask: Normalize the cross tab to sum to 1, and then plot the cross tab

Can we see if passenger class has a significant impact on whether a passenger survived?

2.4 Feature 2: Sex

Now let’s analyze the effect of gender on survival:

  1. Open a new Markdown cell and write the header ## Feature: Sex

  2. Add a code cell, press ctrl + i and ask: Generate a mapping of Sex from a string to a number representation in a new column called "Sex_Val". Map 0 for female and 1 for male

  3. Create a new code cell and ask: Plot a normalized cross tab for Sex_Val and Survived

  4. Create a new code cell and ask: Make two plots for each sex that shows the survival rate per Pclass

Who do you think had the highest chances on survival?

2.5 Feature 3: Embarked

The Embarked column might be an important feature but it is missing a couple data points which might pose a problem for machine learning algorithms

  1. Add a Markdown cell: ## Feature: Embarked
  2. Create new code cells and ask Copilot to:
    • Show the records where Embarked is missing
    • Get the unique values of Embarked and map them to numbers (no decimals) in the column Embarked_Val
    • Plot the histogram for Embarked_Val
    • Assign the most common value (‘S’) to missing entries
    • Plot a normalized cross tab for Embarked_Val and Survived

2.6 Feature 4: Age

The Age column seems like an important feature–unfortunately it is missing many values. We’ll need to fill in the missing values like we did with Embarked.

  1. Add a Markdown cell: ## Feature: Age. Then in code cells, prompt Copilot to:
    • Show a histogram of Age
    • Show the records where Age is missing
    • Fill missing values with the median age in a new column AgeFill
    • Plot a normalized crosstab of AgeFill and Survived
    • Plot AgeFill density by Pclass
    • Analyze age by passenger class
    • Make a table with the median age for each Pclass

2.7 Final preparation

Many machine learning algorithms do not work on strings and they usually require the data to be in an array, not a DataFrame.

  1. Ask Copilot to:
    • Show the columns in the dataframe with dtypes object and drop them
    • Create a dataframe called df_train with the columns Survived, Pclass, Fare, Sex_Val, Embarked_Val, AgeFill and FamilySize

2.8 Random Forest: Training

Train a machine learning model to predict Titanic survivors. Ask copilot to:

1
2
3
4
Write a Python script using scikit-learn to train a Random Forest classifier.
The classifier should use 100 decision trees (n_estimators=100). 
Use df_train where the first column ('Survived') is the target, and the remaining columns are the features. 
Fit the model to the training data and calculate the mean accuracy of the classifier.

2.9 Random Forest: Prediction

Download the titanic test dataset here and save it into your HAI5016 folder. Then ask Copilot to:

  1. Load the test data titanic-test.csv into a DataFrame and fill in missing values
  2. Make predictions using the Random Forest classifier
  3. Write the predictions to a CSV file

Sources

This post is licensed under CC BY 4.0 by the author.