[HAI5016] Week 3: Git and Github
A modern data scientist uses Git to store code, track changes and versions, and collaborate with their team. Visual Studio Code has built-in source control management (SCM) with Git support included by default. Additional source control providers can be accessed through extensions available on the VS Code Marketplace.
Disclaimer: This blog provides instructions and resources for the workshop part of my lectures. It is not a replacement for attending class; it may not include some critical steps and the foundational background of the techniques and methodologies used. The information may become outdated over time as I do not update the instructions after class.
To use Git and GitHub in VS Code, first ensure Git is installed on your computer. If Git is not installed, the Source Control view will provide instructions on how to set it up. After installing Git, be sure to restart VS Code.
1. Create a repository for your project
With your project folder open in VS Code, click the Source Control button in the left menu bar.
Click on ‘Initialize Repository’. The repo status indicator in the bottom right should show the name of the active branch,
main
. Also, a greenU
should appear next to the filenames in your folder, meaning that the files (or any changes to them) are untracked and not yet added to the repository.Right-click on our data file
titanic3.csv
and click ‘Add to .gitignore’. Then, check out the newly created.gitignore
file.Type
first
in the message field and click the blue Commit button
2. Push the repository to GitHub
In the repo status indicator in the bottom left, press the Publish to GitHub icon, which is an upward arrow into a cloud
A browser window will open. Sign into GitHub and press ‘open’ to reopen VS Code.
The command palette should now show two options. Select
Publish to GitHub Private Repository
.Open Github and find your first published private repo.
3. Explore Git source control
- Source Control Graph: shows a graphical representation of the commits that are incoming and outgoing
- Git Status Bar actions: easily pull remote changes down to your local repository and then push local commits to the upstream branch
- Gutter indicators: a red triangle indicates where lines have been deleted, a green bar indicates newly added lines and a blue bar indicates modified lines. You can click them to see what has actually changed.
- Timeline view: a unified view for visualizing time-series events (for example, Git commits) of your file
If you want
You can exclude the output cells of Jupyter notebooks from your Git commits. I found this post on GitHub Gist that recommends doing the following:
- Add a filter to git config by running the following command in bash inside the repo:
1
git config filter.strip-notebook-output.clean 'jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR'
Create a
.gitattributes
file inside the directory with the notebooks- Add the following to that file:
1
*.ipynb filter=strip-notebook-output
After that, commit to git as usual. The notebook output will be stripped out in git commits, but it will remain unchanged locally.