azure, data science, machine learning in production

Building a Data Science Environment in Azure: Part 1

For the last few months, I have been looking into how to create a Data Science environment within Azure. There are multiple ways to approach this and it depends on the size and needs of your team. This is just one way in a space where there are many others (e.g using Databricks).

Over the next few months, I will be running a few posts about how to get this kind of environment up and running.

First off, let’s mention some reasons why you might be looking to set up a Data Science sandbox in Azure rather than on premise.

Reasons why:

  • On-prem machines too slow.
  • Inappropriate (or no) tooling on-prem (and not fast enough deployment of relevant tools to local machines).
  • Slow IT process to request increased compute.
  • On-prem machines are under utilised.
  • Different needs per user in the team. One person may be running some heavy calculations, whilst someone else just runs some small weekly reports.
  • Lack of collaboration within company (perhaps cross department or even regional).
  • No clear process for getting models into production.

Once you have clarified the why, you can start to shape the high level requirements of your environment. Key requirements of our Data Science Sandbox could be:

  • Flexibility – Enable both IT to have control but also the data scientists to have choice.
  • Freedom – Enable data scientists by giving them the freedom to work with the tools they feel most confident.
  • Collaboration – Encourage collaborating, sharing of methods and also the ability to re-use and improve models across a business.

You will want to think about who your users are, what tools they are currently using and also what they want to use going forward.

At this point, you might do a little scribble on a piece of paper to define what this might look like in principle. Here is my very simple overview of what we are going to be building over the next few posts. I’ve taken inspiration from a number of Microsoft’s own process diagrams.

Let’s take a look in more detail at the above process.

  1. We have our Data Science sandbox, which is where the model build takes place. The Lab has access to production data but may also need to make API calls or users may want to access their own personal files located in blob etc. This component is composed of a number of labs (via Dev Test Labs). These labs could be split by team/subject area etc.
  2. Once we have a model we would like to move to production, the model is version controlled, containerised and deployed via Kubernetes. This falls under the Data Ops activity.
  3. The model is served in a production environment and we take the inputs and then monitor the performance of our model. For now, I have this as ML Service but you could also use ML Flow or KubeFlow.
  4. This feeds back into the model, which can be retrained if necessary and the process starts again.

The main technology components proposed are:

  • Dev Test Labs
  • Docker
  • Kubernetes
  • ML Service

In the next post, we will start setting up our Data Science environment. We will start by looking at setting up Dev Test Labs in Azure.

azure, data bricks, databricks, version control

Version control with GitHub and Databricks

In this post I thought I would share a method for version controlling code in Databricks. I will go over a simple Databricks/GitHub sync for personal projects. In the next post I will discuss a method for multi-branching.

Pre-requisites:

  • A Databricks account
  • A GitHub account

Create a notebook in Databricks

Open a new notebook (or alternatively something you would like to version control). For the purposes of this, I have just made a generic Python notebook called test_git.py.

My code in test_git.py is the simplest Python script:

a=1
b=2
c=3

print(a)
print(a+b)

Create a GitHub Repo

Create a new repo in GitHub and initialise it with a readme.md. You will only be using the master branch.

Connect Databricks & GitHub

In the main Databricks UI, in the top right corner you will see a little person; hover and it will say ‘account’. Click on this and then select ‘User Settings’ and then head to the ‘Git Integration’ tab (as shown below).

Select GitHub as your ‘Git provider’. You will need to enter a git token, which you can generate in the GitHub developer settings area. Once you have done this, your GitHub and Databricks account will be linked.

Sync Databricks Notebook with GitHub

Now open the notebook that you want to version control. In the top right, you will see some little icons; select the last one (highlighted below).

This will open up the Git Preferences box, where you can sync the notebook and git together.

By default Databricks will put the Databricks folder structure in the Git repo file path, but you want to change it to match the one in GitHub (see below). Ensure the files have the same name, otherwise it will just write a new file to the folder.

In this case, for the ‘Path in Git Repo’ I am going to create a folder called vc_code and put the git_test.py file inside it.

Press save and then head over to GitHub, you should see your notebook 🙂 Now any changes you make to your code will be synced with GitHub. You can save by selecting the ‘Save Now’ option on the right hand side.

If you are just doing version control for yourself, then you could stop here.

However, if you want to work with multiple users, I will discuss a simple methodology for collaboration in my next post.

azure, data bricks, databricks, python

Event Hub Streaming Part 2: Reading from Event Hub using Python

In part two of our tutorial, we will read back the events from our messages that we streamed into our Event Hub in part 1. For a real stream, you will need to start the streaming code and ensure that you are sending more than ten messages (otherwise your stream will have stopped by the time you start reading :)). It will still work though.

So the code is pretty much along the same lines, same packages etc. Let’s take a look.

Import the libraries we need:

import os
import sys
import logging
import time
from azure.eventhub import EventHubClient, Receiver, Offset

Set the connection properties to Event Hub:

ADDRESS = "amqps://<namespace.servicebus.windows.net/<eventhubname>"
USER = "<policy name>"
KEY = "<primary key>"
CONSUMER_GROUP = "$default"
OFFSET = Offset("-1")
PARTITION = "0"

This time I am using my listening USER instead of my sending USER policy.

Next we are going to take the events from the Event Hub and print each json transaction message. I will try to go through offsets in a bit more detail another time, but for now this will listen and return back your events.

total = 0
client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
try:
    receiver = client.add_receiver(CONSUMER_GROUP, PARTITION, prefetch=5000, offset=OFFSET)
    client.run()
    start_time = time.time()
    batch = receiver.receive(timeout=5000)
    while batch:
        for event_data in batch:
            print("Received: {}, {}".format(last_offset.value, last_sn))
            print(event_data.message)#body_as_str())
            total += 1
        batch = receiver.receive(timeout=5000)

    end_time = time.time()
    client.stop()
    run_time = end_time - start_time

And voila! You now know how to stream to and read from Azure Event Hub using Python 🙂

Let me know if you have any questions!

azure, data bricks, databricks, python

Event Hub Streaming Part 1: Stream into Event hub using Python

In this session we are going to look at how to stream data into event hub using Python.

We will be connecting to the blockchain.info websocket and streaming the transactions into an Azure Event Hub. This is a really fun use case that is easy to get up and running.

Prerequisites:

  • An Azure subscription
  • An Azure Event Hub
  • Python (Jupyter or I am using Databricks in this example)

You will need the following libraries installed on your Databricks cluster:

  • websocket-client (PyPi)
  • azure-eventhub (PyPi)

In this example, I am setting it to only stream in a few events, but you can change it to keep streaming or stream more events in.

First of all we need to import the various libraries we are going to be using.

import os
import sys
import logging
import time
from azure import eventhub
from azure.eventhub import EventHubClient, Receiver, Offset, EventData
from websocket import create_connection

Then we need to set the connection properties for our Event Hub:

ADDRESS = "amqps://<namespace>.servicebus.windows.net/<eventhubname>"
USER = "<policy name>"
KEY = "<primary key>"
CONSUMER_GROUP = "$default"
OFFSET = Offset("-1")
PARTITION = "0"

The user is the policy name, which you set for your event hub under the ‘shared access policies’ area. I usually create one policy for sending and one for listening.

The offset and partitioning I will go into more detail another time. For now, don’t worry about these, just add the values above.

Next we need to connect to the blockchain.info websocket. We send it the message that starts the stream.

ws = create_connection("wss://ws.blockchain.info/inv")
ws.send('{"op":"unconfirmed_sub"}')

Now we are only going to receive eleven messages in this code, but you can change it to i >100 (or more) or even remove that part and just keep going.

try:
    if not ADDRESS:
        raise ValueError("No EventHubs URL supplied.")
 
    # Create Event Hubs client
    client = EventHubClient(ADDRESS, debug=False, username=USER, password=KEY)
    sender = client.add_sender(partition="0")
    client.run()
    
    i = 0
    
    start_time = time.time()
    try:
        while True:
            sender.send(EventData(ws.recv()))
            print(i)
            if i > 10:
                break
            i = i + 1
    except:
        raise
    finally:
        end_time = time.time()
        client.stop()
        run_time = end_time - start_time

except KeyboardInterrupt:
    pass

In Part 2, we look at how to read these events back from the Event Hub.