For the last few months, I have been looking into how to create a Data Science environment within Azure. There are multiple ways to approach this and it depends on the size and needs of your team. This is just one way in a space where there are many others (e.g using Databricks).
Over the next few months, I will be running a few posts about how to get this kind of environment up and running.
First off, let’s mention some reasons why you might be looking to set up a Data Science sandbox in Azure rather than on premise.
- On-prem machines too slow.
- Inappropriate (or no) tooling on-prem (and not fast enough deployment of relevant tools to local machines).
- Slow IT process to request increased compute.
- On-prem machines are under utilised.
- Different needs per user in the team. One person may be running some heavy calculations, whilst someone else just runs some small weekly reports.
- Lack of collaboration within company (perhaps cross department or even regional).
- No clear process for getting models into production.
Once you have clarified the why, you can start to shape the high level requirements of your environment. Key requirements of our Data Science Sandbox could be:
- Flexibility – Enable both IT to have control but also the data scientists to have choice.
- Freedom – Enable data scientists by giving them the freedom to work with the tools they feel most confident.
- Collaboration – Encourage collaborating, sharing of methods and also the ability to re-use and improve models across a business.
You will want to think about who your users are, what tools they are currently using and also what they want to use going forward.
At this point, you might do a little scribble on a piece of paper to define what this might look like in principle. Here is my very simple overview of what we are going to be building over the next few posts. I’ve taken inspiration from a number of Microsoft’s own process diagrams.
Let’s take a look in more detail at the above process.
- We have our Data Science sandbox, which is where the model build takes place. The Lab has access to production data but may also need to make API calls or users may want to access their own personal files located in blob etc. This component is composed of a number of labs (via Dev Test Labs). These labs could be split by team/subject area etc.
- Once we have a model we would like to move to production, the model is version controlled, containerised and deployed via Kubernetes. This falls under the Data Ops activity.
- The model is served in a production environment and we take the inputs and then monitor the performance of our model. For now, I have this as ML Service but you could also use ML Flow or KubeFlow.
- This feeds back into the model, which can be retrained if necessary and the process starts again.
The main technology components proposed are:
- Dev Test Labs
- ML Service
In the next post, we will start setting up our Data Science environment. We will start by looking at setting up Dev Test Labs in Azure.