In this post I thought I would share a method for version controlling code in Databricks. I will go over a simple Databricks/GitHub sync for personal projects. In the next post I will discuss a method for multi-branching.
- A Databricks account
- A GitHub account
Create a notebook in Databricks
Open a new notebook (or alternatively something you would like to version control). For the purposes of this, I have just made a generic Python notebook called
My code in
test_git.py is the simplest Python script:
a=1 b=2 c=3 print(a) print(a+b)
Create a GitHub Repo
Create a new repo in GitHub and initialise it with a readme.md. You will only be using the master branch.
Connect Databricks & GitHub
In the main Databricks UI, in the top right corner you will see a little person; hover and it will say ‘account’. Click on this and then select ‘User Settings’ and then head to the ‘Git Integration’ tab (as shown below).
Select GitHub as your ‘Git provider’. You will need to enter a git token, which you can generate in the GitHub developer settings area. Once you have done this, your GitHub and Databricks account will be linked.
Sync Databricks Notebook with GitHub
Now open the notebook that you want to version control. In the top right, you will see some little icons; select the last one (highlighted below).
This will open up the Git Preferences box, where you can sync the notebook and git together.
By default Databricks will put the Databricks folder structure in the Git repo file path, but you want to change it to match the one in GitHub (see below). Ensure the files have the same name, otherwise it will just write a new file to the folder.
In this case, for the ‘Path in Git Repo’ I am going to create a folder called
vc_code and put the
git_test.py file inside it.
Press save and then head over to GitHub, you should see your notebook 🙂 Now any changes you make to your code will be synced with GitHub. You can save by selecting the ‘Save Now’ option on the right hand side.
If you are just doing version control for yourself, then you could stop here.
However, if you want to work with multiple users, I will discuss a simple methodology for collaboration in my next post.