databricks, github, version control

Multi-user Branching Approach using Databricks

In my previous post on version control using Databricks, we looked at how to link GitHub and Databricks. Following on from this I wanted to show a simple branching methodology that could work for a small team collaborating in the Databricks environment. This is for an environment where the users will not pull code down onto their own machine and are potentially new to git in general.

This is not a CI/CD pipeline and isn’t a large scale solution. It’s also very manual and I would recommend looking at something like devops pipelines or even GitHub actions (which I will discuss another time).

In this example, we will have two main branches; master and development. This is pretty standard practise. We are assuming that developers clone from and make pull requests to development and then admins sync up development and master. I won’t be discussing branch policies or anything like that but feel free to ask if you have any questions about how to set those up.

For every feature a new branch will be created by the developer. This is called feature branching.

You will need to read the previous post as we will start where we left off; with our master code folder.

Create a Development Branch

Now we are also going to add a branch for development code in Databricks. To do this, clone a copy of the master file to a development folder in Databricks. Ensure the filename is kept the same for consistency.

Sync this with a new branch called development in GitHub (in the same way you synced the master file. However, this time in ‘Branch’, add development and change the Path in ‘Git Repo’ to match the master one (highlighted).

Create a Feature Branch

To create a feature branch, a developer will clone a copy of the code from the development folder into their own workspace. In Databricks, you have the option to clone a notebook by selecting the tiny arrow next to the notebook. It will give you the option of where you want to put it.

Make a new folder for the feature development (e.g jo_change_c_value) and clone the notebook into there.

Once the notebook has been cloned, we will then sync it with a new feature branch folder in the same way as the image above. In this example, I create a new branch under ‘Git Preferences’ called ‘jo_change_c_value’ and sync my notebook.

I’m then going to change the code and set the c value equal to 4.

a=1
b=2
c=4

print(a)
print(a+b)

Click on save on the right hand side and it will ask you to save and additionally commit changes to GitHub if you want. Save the revision.

Make a PR to Development branch

You will see the change in GitHub.

Click on the ‘Compare & pull request’ and you will see an overview of the changes. In this example, I am asking to push changes to the development branch. This is just for good practise, if you only have a master branch then just ask to merge to there.

Press the green button that says ‘Create Pull request’.

You will then see the following page.

Click ‘Merge pull request’.

Now head to your development notebook in Databricks.

This is a bit odd but your changes will not immediately be visible (not sure why Databricks does this… if anyone does then let me know). On the right hand side in revision history, you will see the second to top record says Commit and then a number.

You need to click on the Commit one and this will be your newly pushed code.

Make a PR to master branch

To sync up the development and master branches, just follow the same process as above. As mentioned, ideally you would have some policies in place so that there is some sort of peer review process on PRs. However, that’s outside the scope of this post.

Databricks Version Control summary

Hopefully I have shown you how to version control for single branch personal and for multi user branching approaches. In my honest opinion, it’s not the most user friendly and there is room for human error (especially with the manual entry of folder paths). Further, it’s not very automated and this could be frustrating.

A better approach would be to employ a CI/CD approach. I’ll create a follow up post on this.

kubernetes

AKS (Azure Kubernetes Service) through azure cli

You can deploy AKS using the azure-cli. Here is a quick tutorial on how to do it!

First off, ensure you have the azure-cli installed on your machine. If you have a mac it’s pretty simple. You can just do:

brew install azure-cli

If you are using Windows or Linux then follow the instructions on the Microsoft page.

Login to you Azure account using:

az login

Now we are going to create some variables for location, resource and cluster that we will use to set up our Kubernetes cluster:

export LOCATION=uksouth
export RESOURCE_GROUP=aks-project-group
export CLUSTER_NAME=josiemundi-cluster

Now we create a resource group for our cluster:

az group create --name $RESOURCE_GROUP --location $LOCATION

Now lets deploy our cluster:

az aks create --resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--generate-ssh-keys \
--node-vm-size Standard_DS2_v2

You can actually head over to your Azure portal and within the resource group we set up, you should now see Kubernetes service. If you click on it, you will see it is in the process of deploying (or maybe already has).

We now need to link with our azure account

az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME --admin

Now we can see our deployment with the following command:

kubectl get nodes

Here is a link to a list of the cli commands available for aks.

azure, data science, machine learning in production

Building a Data Science Environment in Azure: Part 1

For the last few months, I have been looking into how to create a Data Science environment within Azure. There are multiple ways to approach this and it depends on the size and needs of your team. This is just one way in a space where there are many others (e.g using Databricks).

Over the next few months, I will be running a few posts about how to get this kind of environment up and running.

First off, let’s mention some reasons why you might be looking to set up a Data Science sandbox in Azure rather than on premise.

Reasons why:

  • On-prem machines too slow.
  • Inappropriate (or no) tooling on-prem (and not fast enough deployment of relevant tools to local machines).
  • Slow IT process to request increased compute.
  • On-prem machines are under utilised.
  • Different needs per user in the team. One person may be running some heavy calculations, whilst someone else just runs some small weekly reports.
  • Lack of collaboration within company (perhaps cross department or even regional).
  • No clear process for getting models into production.

Once you have clarified the why, you can start to shape the high level requirements of your environment. Key requirements of our Data Science Sandbox could be:

  • Flexibility – Enable both IT to have control but also the data scientists to have choice.
  • Freedom – Enable data scientists by giving them the freedom to work with the tools they feel most confident.
  • Collaboration – Encourage collaborating, sharing of methods and also the ability to re-use and improve models across a business.

You will want to think about who your users are, what tools they are currently using and also what they want to use going forward.

At this point, you might do a little scribble on a piece of paper to define what this might look like in principle. Here is my very simple overview of what we are going to be building over the next few posts. I’ve taken inspiration from a number of Microsoft’s own process diagrams.

Let’s take a look in more detail at the above process.

  1. We have our Data Science sandbox, which is where the model build takes place. The Lab has access to production data but may also need to make API calls or users may want to access their own personal files located in blob etc. This component is composed of a number of labs (via Dev Test Labs). These labs could be split by team/subject area etc.
  2. Once we have a model we would like to move to production, the model is version controlled, containerised and deployed via Kubernetes. This falls under the Data Ops activity.
  3. The model is served in a production environment and we take the inputs and then monitor the performance of our model. For now, I have this as ML Service but you could also use ML Flow or KubeFlow.
  4. This feeds back into the model, which can be retrained if necessary and the process starts again.

The main technology components proposed are:

  • Dev Test Labs
  • Docker
  • Kubernetes
  • ML Service

In the next post, we will start setting up our Data Science environment. We will start by looking at setting up Dev Test Labs in Azure.

data science, python

Text mining: NLTK suite for Python

Today we are going to take a quick look at the NLTK suite for Python.

We could use NLTK for situations where we need to handle human language.

Things like:

  • Customer complaints classification
  • Sentiment analysis
  • Chatbot development
  • Insurance claim description analysis
  • Scanning candidate cvs

In this post, we will start with a large chunk of text (taken from the NLTK Wikipedia page) and then clean it, split it into substrings and then plot the frequency of each word.

First off, we need to import the relevant libraries and packages that we will be using:

#import relevant libraries and packages

import re
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk import FreqDist
from nltk.tokenize import RegexpTokenizer

Next we need to create a new object, which is our text from Wikipedia.

#Text below is taken from the NLTK page on Wikipedia. 

my_text = """The Natural Language Toolkit, or more commonly NLTK,
is a suite of libraries and programs for symbolic and statistical
natural language processing (NLP) for English written in the Python
programming language. It was developed by Steven Bird and Edward Loper
in the Department of Computer and Information Science at the University
of Pennsylvania. NLTK includes graphical demonstrations and sample data.
It is accompanied by a book that explains the underlying concepts behind
the language processing tasks supported by the toolkit,plus a cookbook."""

I have assigned it to the variable name my_text.

Next we want to replace any newline notation with a space, so it won’t show \n. For this we use the re module, which enable us to use regular expressions. I have also made everything lower as otherwise it will have ‘Language’ and ‘language’ as two different words.

#substitute \n newline within 'my_text' with a space and assign this to the 'document' object
doc = re.sub('\n', ' ', my_text)
document = doc.lower()

We can use the nltk tokenizer to divide the text up into individual words or even sentences. For example, if I wanted to divide it into sentences I could do:

nltk.sent_tokenize(document)
print(document)

returns:

We are going to divide the string into individual words so that we can plot the frequency of each word. First we want to remove stop words, otherwise our most common word is going to be something like ‘of’, which isn’t so helpful.

NLTK already has a dictionary of stop words that we can use.

stop_words = stopwords.words('english')

In this next step, we are going to write a function that returns only those words that are not in the stop words variable. First, we need to use the tokenizer to divide our string into individual words. We are also going to remove any punctuation, marks otherwise these will also be classed as words.

tokenizer = RegexpTokenizer(r'\w+')
my_words = tokenizer.tokenize(document)

We are then going to create an empty list and call it my_words_ns. Then we have a function that loops through each word from my_words and, for each one that is not found in the stop words, appends it to my_words_ns.

my_words_ns = []

for word in my_words:
    if word not in stop_words:
        my_words_ns.append(word)

NLTK has it’s own frequency distribution function, which we can then use to plot the frequency of each word. Let’s apply it to our list of words.

freqDist = FreqDist(my_words_ns)

You can get the frequency of a specific word like this:

print(freqDist["language"])

Now let’s plot the top ten words:

my_plot = freqDist.plot(10)

And there you have it. This can then be built out in order to inform certain business processes. Next time, we will use things like genism and Microsoft Cognitive Services to explore what we can achieve by harnessing the power of machine learning.

azure, data bricks, databricks, version control

Version control with GitHub and Databricks

In this post I thought I would share a method for version controlling code in Databricks. I will go over a simple Databricks/GitHub sync for personal projects. In the next post I will discuss a method for multi-branching.

Pre-requisites:

  • A Databricks account
  • A GitHub account

Create a notebook in Databricks

Open a new notebook (or alternatively something you would like to version control). For the purposes of this, I have just made a generic Python notebook called test_git.py.

My code in test_git.py is the simplest Python script:

a=1
b=2
c=3

print(a)
print(a+b)

Create a GitHub Repo

Create a new repo in GitHub and initialise it with a readme.md. You will only be using the master branch.

Connect Databricks & GitHub

In the main Databricks UI, in the top right corner you will see a little person; hover and it will say ‘account’. Click on this and then select ‘User Settings’ and then head to the ‘Git Integration’ tab (as shown below).

Select GitHub as your ‘Git provider’. You will need to enter a git token, which you can generate in the GitHub developer settings area. Once you have done this, your GitHub and Databricks account will be linked.

Sync Databricks Notebook with GitHub

Now open the notebook that you want to version control. In the top right, you will see some little icons; select the last one (highlighted below).

This will open up the Git Preferences box, where you can sync the notebook and git together.

By default Databricks will put the Databricks folder structure in the Git repo file path, but you want to change it to match the one in GitHub (see below). Ensure the files have the same name, otherwise it will just write a new file to the folder.

In this case, for the ‘Path in Git Repo’ I am going to create a folder called vc_code and put the git_test.py file inside it.

Press save and then head over to GitHub, you should see your notebook 🙂 Now any changes you make to your code will be synced with GitHub. You can save by selecting the ‘Save Now’ option on the right hand side.

If you are just doing version control for yourself, then you could stop here.

However, if you want to work with multiple users, I will discuss a simple methodology for collaboration in my next post.

azure, data bricks, databricks, python

Event Hub Streaming Part 2: Reading from Event Hub using Python

In part two of our tutorial, we will read back the events from our messages that we streamed into our Event Hub in part 1. For a real stream, you will need to start the streaming code and ensure that you are sending more than ten messages (otherwise your stream will have stopped by the time you start reading :)). It will still work though.

So the code is pretty much along the same lines, same packages etc. Let’s take a look.

Import the libraries we need:

import os
import sys
import logging
import time
from azure.eventhub import EventHubClient, Receiver, Offset

Set the connection properties to Event Hub:

ADDRESS = "amqps://<namespace.servicebus.windows.net/<eventhubname>"
USER = "<policy name>"
KEY = "<primary key>"
CONSUMER_GROUP = "$default"
OFFSET = Offset("-1")
PARTITION = "0"

This time I am using my listening USER instead of my sending USER policy.

Next we are going to take the events from the Event Hub and print each json transaction message. I will try to go through offsets in a bit more detail another time, but for now this will listen and return back your events.

total = 0
client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
try:
    receiver = client.add_receiver(CONSUMER_GROUP, PARTITION, prefetch=5000, offset=OFFSET)
    client.run()
    start_time = time.time()
    batch = receiver.receive(timeout=5000)
    while batch:
        for event_data in batch:
            print("Received: {}, {}".format(last_offset.value, last_sn))
            print(event_data.message)#body_as_str())
            total += 1
        batch = receiver.receive(timeout=5000)

    end_time = time.time()
    client.stop()
    run_time = end_time - start_time

And voila! You now know how to stream to and read from Azure Event Hub using Python 🙂

Let me know if you have any questions!

azure, data bricks, databricks, python

Event Hub Streaming Part 1: Stream into Event hub using Python

In this session we are going to look at how to stream data into event hub using Python.

We will be connecting to the blockchain.info websocket and streaming the transactions into an Azure Event Hub. This is a really fun use case that is easy to get up and running.

Prerequisites:

  • An Azure subscription
  • An Azure Event Hub
  • Python (Jupyter or I am using Databricks in this example)

You will need the following libraries installed on your Databricks cluster:

  • websocket-client (PyPi)
  • azure-eventhub (PyPi)

In this example, I am setting it to only stream in a few events, but you can change it to keep streaming or stream more events in.

First of all we need to import the various libraries we are going to be using.

import os
import sys
import logging
import time
from azure import eventhub
from azure.eventhub import EventHubClient, Receiver, Offset, EventData
from websocket import create_connection

Then we need to set the connection properties for our Event Hub:

ADDRESS = "amqps://<namespace>.servicebus.windows.net/<eventhubname>"
USER = "<policy name>"
KEY = "<primary key>"
CONSUMER_GROUP = "$default"
OFFSET = Offset("-1")
PARTITION = "0"

The user is the policy name, which you set for your event hub under the ‘shared access policies’ area. I usually create one policy for sending and one for listening.

The offset and partitioning I will go into more detail another time. For now, don’t worry about these, just add the values above.

Next we need to connect to the blockchain.info websocket. We send it the message that starts the stream.

ws = create_connection("wss://ws.blockchain.info/inv")
ws.send('{"op":"unconfirmed_sub"}')

Now we are only going to receive eleven messages in this code, but you can change it to i >100 (or more) or even remove that part and just keep going.

try:
    if not ADDRESS:
        raise ValueError("No EventHubs URL supplied.")
 
    # Create Event Hubs client
    client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
    sender = client.add_sender(partition="0")
    client.run()
    
    i = 0
    
    start_time = time.time()
    try:
        while True:
            sender.send(EventData(ws.recv()))
            print(i)
            if i > 10:
                break
            i = i + 1
    except:
        raise
    finally:
        end_time = time.time()
        client.stop()
        run_time = end_time - start_time

except KeyboardInterrupt:
    pass

In Part 2, we look at how to read these events back from the Event Hub.

docker

How to: Docker Swarm

This tutorial will show you how to get your first Docker swarm up and running. In my example, I am using two Ubuntu machines, one will be the master and one will be the worker.

Install Docker Community Edition on the machines.

Follow the instructions on the website in order to install docker ce for Ubuntu.

Check Docker is installed by running:

docker --version

Install Docker machine:

Docker machine will allow us to install a docker engine on our machines and manage them using docker-machine commands. You can find out more about Docker machine on their website.

curl -L https://github.com/docker/machine/releases/download/v0.16.0/docker-machine-`uname -s`-`uname -m` >/tmp/docker-machine &&
chmod +x /tmp/docker-machine &&
sudo cp /tmp/docker-machine /usr/local/bin/docker-machine

I have installed docker machine on both of the hosts.

On both servers, we need to change the hosts file via command line. You will need to add two lines to the end of the file. The easiest way to do this is using vim text editor, which will allow you to edit the file in command line. We need to add the ip address of each machine and let it know which is the manager and which is the worker.

If you have additional workers then you can add worker02, worker03 etc.

sudo vim /etc/hosts

10.0.16.5    manager
10.0.16.4    worker01

We now need to run the following from the manager machine:

sudo docker swarm init --advertise-addr 10.0.16.5 

We will get a response back, which will be a docker swarm join that we then need to run on the worker machine. It will look something like this (example from dockers own swarm tutorial):

To add a worker to this swarm, run the following command:

    docker swarm join \
    --token SWMTKN-1-49nj1cmql0jkz5s954yi3oex3nedyz0fb0xx14ie39trti4wxv-8vxv8rssmk743ojnwacrr2e7c \
    192.168.99.100:2377

Once you have done this your swarm will be up and running. If you run:

docker info

you will be able to see the details of your swarm, including how many nodes, managers, containers etc.

Here are some useful links on Docker swarm that I used to get mine up and running:

https://docs.docker.com/engine/swarm/

https://www.howtoforge.com/tutorial/ubuntu-docker-swarm-cluster/

Next time we will look at getting something up and running in docker swarm mode.

docker, kubernetes, machine learning in production

Deploying an ML model in Kubernetes

A while back I started looking into how to deploy and scale Machine Learning models. I was recommended the book Machine Learning Logistics by Ted Dunning and Ellen Friedman and started to look into their proposed method of deployment. So far, I have only got to the containerisation and orchestration, however there is still a whole lot more to do 🙂 

I thought I would offer and easy tutorial to get started if you want to try this out. I’m not going to talk about a production ready solution as this would need a fair bit of refinement. 

There are various options for doing this (feel free to let me know what you might be implementing) and this is just one possible approach. I guess the key is to do what fits best with your workflow process. 

All of the code is on GitHub, so if you want to follow along then head there for a more detailed run through (including all code and commands to run etc). I’m not going to put it all in this post as it would be very long 🙂 

For this project I decided to run everything from of the Azure DSVM (Data Science Virtual Machine). However, you can run it locally from your own machine. I ran it from the following spec machine:

Standard B2ms (2 vcpus, 8 GB memory) (Linux)

You will need:

  • Jupyter Notebooks (already on the DSVM)
  • Docker (already on the DSVM)
  • A Docker hub account
  • An Azure account with AKS

Building the model

I won’t go much into the model code but basically I built a simple deep learning model using Keras and the open source wine dataset. The model was created by following this awesome tutorial from DataCamp!

I followed the tutorial step by step and then saved the model. Keras has it’s own save function, which is recommended over using pickle. You need to save both the model and the scaler because we will need it to normalise the data afterwards in the flask app.

Building a Web app using Flask and Containerising it

If you are using the DSVM then under the ‘Networking’ options we need to add another option for the ‘Inbound Port Rules’. Add port 5000. This is the port where our flask app will run. 

For building the docker container, I used this easy to follow ‘Hello Whale’ tutorial by Codefresh as a reference. 

I built a simple flask app, which predicts red or white wine by using some sliders to allocate values to the attributes available in the dataset. As I mentioned, the code for the app is on GitHub. It’s not the prettiest app, feel free to beautify it 🙂 

You will also need to create a Dockerfile and a requirements.txt file (both in the GitHub repo linked above). The Dockerfile contains the commands needed to build the image and the requirements.txt file contains all of the components that your app needs in order to run. 

You will need to make a folder called flask-app and inside place your app.py file, your Dockerfile and your requirements.txt file. 

Navigate via the cli to the flask-app folder and then run the following command:

docker build -t flask-app:latest .

Now to run the container you need to do:

docker run -d -p 5000:5000 flask-app

If you want to stop a docker container then you can use the command:

docker stop <container_name>

Be sure to use the name of the container and not the image name, otherwise it won’t stop. Docker assigns it’s own weird and wonderful names unless you specify otherwise using the –name attribute.

Upload the image to Docker hub

You will need a Docker account to do this. Log in to docker hub using the following command:

docker login --username username

You will then be prompted to enter your password. Then run the following commands to tag and push the image into the repo.

docker tag <your image id> <your docker hub username>/<repo name>

docker push <your docker hub name>/<repo name>

We now have our image available in the Docker hub repo.

Deploying on Kubernetes

For this part I used Azure’s AKS service. It simplifies the Kubernetes process and (for the spec I had) costs a few pounds a day. It has a dashboard UI that is launched in the browser, which lets you easily see your deployments and from there you can do most of the stuff you can do from the cli. 

I set a low spec cluster for Kubernetes:

Standard B2s (2 vcpus, 4 GB memory) and with only 1 node (you can scale it down from the default 3). 

To deploy from the docker hub image. 

Log in to your AKS cluster with the following command:

az aks get-credentials --resource-group <your resource group> --name <your aks cluster>

Pull the image and create a container:

kubectl run wine-app --image=josiemundi/flask-app:latest --port 5000

If you type:

kubectl get pods

You can see the status of your pod. Pods are the smallest unit in Kubernetes and what Kubernetes groups containers in. In this case our container is alone in its pod. It can take a couple of a minutes for a pod to get up and running. 

Expose the app so we get an external ip address for it:

kubectl expose deployment wine-app --type=LoadBalancer --port 80 --target-port 5000

You can check the status of the external ip by using the command:

kubectl get service

This can also take a couple of minutes. Once you have an external ip you can head on over to it and see your app running! 

To delete your deployment use:

kubectl delete deployment <name of deployment>