azure, data bricks, databricks, version control

Version control with GitHub and Databricks

In this post I thought I would share a method for version controlling code in Databricks. I will go over a simple Databricks/GitHub sync for personal projects. In the next post I will discuss a method for multi-branching.

Pre-requisites:

  • A Databricks account
  • A GitHub account

Create a notebook in Databricks

Open a new notebook (or alternatively something you would like to version control). For the purposes of this, I have just made a generic Python notebook called test_git.py.

My code in test_git.py is the simplest Python script:

a=1
b=2
c=3

print(a)
print(a+b)

Create a GitHub Repo

Create a new repo in GitHub and initialise it with a readme.md. You will only be using the master branch.

Connect Databricks & GitHub

In the main Databricks UI, in the top right corner you will see a little person; hover and it will say ‘account’. Click on this and then select ‘User Settings’ and then head to the ‘Git Integration’ tab (as shown below).

Select GitHub as your ‘Git provider’. You will need to enter a git token, which you can generate in the GitHub developer settings area. Once you have done this, your GitHub and Databricks account will be linked.

Sync Databricks Notebook with GitHub

Now open the notebook that you want to version control. In the top right, you will see some little icons; select the last one (highlighted below).

This will open up the Git Preferences box, where you can sync the notebook and git together.

By default Databricks will put the Databricks folder structure in the Git repo file path, but you want to change it to match the one in GitHub (see below). Ensure the files have the same name, otherwise it will just write a new file to the folder.

In this case, for the ‘Path in Git Repo’ I am going to create a folder called vc_code and put the git_test.py file inside it.

Press save and then head over to GitHub, you should see your notebook πŸ™‚ Now any changes you make to your code will be synced with GitHub. You can save by selecting the ‘Save Now’ option on the right hand side.

If you are just doing version control for yourself, then you could stop here.

However, if you want to work with multiple users, I will discuss a simple methodology for collaboration in my next post.

3 thoughts on “Version control with GitHub and Databricks

    1. Hi Helen,

      I’ve just published it πŸ™‚ It was in my drafts for a while as I find the simple process a bit manual (but it works!). I will try and post a more automated solution soon!

      Josie

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s