analytics application architectures - part 3

overview

In the second post we looked for ways to better integrate with spark data pipelines and improve model hosting capabilities. We introduced databricks, a ‘unified platform for data and ai’ and discussed how data scientists might interact with data engineers and the platform itself. In part 3 we will discuss how to (1) automate deployment worflows using that same toolset

recalling part 2 limitations

while version control exists for code, data version control does not exist
limited reproducibility
low data sciecne capacity; inability to iterate on models quickly
PII is vulnerable
manual updates required to update dashboard
no drift and model performance monitoring

part 3 - automating deployment

python + dask - language of choice; package for improving compute
github - version control
s3 - distributed storage
ec2 - distributed compute
databricks connect + cli - hosting, compute, scheduling
mlflow - ml lifecycle management
github actions - ci
power bi - visualization

general flow

install python + your favorite IDE
setup a databricks account and configure a cluster with python
install databricks cli in the workspace of your chosing
```
pip install databricks-cli
```
configure access token, authenticate to the cli and verify contents of workspaces and specific user profiles; open your bash or powershell core and run the following commands
```
databricks configure --token 
```
```
databricks host (should begin with https://): https://<some-stuff>.cloud.databricks.com
token: <token you created while configuring access token>
```
```
databricks workspace ls
databricks workslace ls /Users/<user@email.com>
databricks libraries list
```
databricks library list will not pick up packages that have been installed via anaconda e.g. numpy / pandas. i recommend installing python directly so you can control package versions from the same location

create repo / directory structure

basic repo structure

 ├── LICENSE
 ├── README.md     # Top-level README for developers using this project.
 ├── data          # ingestion commands; templates for calling various data sources
 │   ├── external  # connection to data from third party sources, if applicable.
 │   ├── interim   # connection to intermediate data that has been transformed.
 │   └── raw       # connection to the original, immutable data dump.
 │
 ├── src           # trained and serialized models
 ├── notebooks     # Jupyter notebooks to assist with documentation
 ├── configuration # toml, yml, or otherwise
 ├── .gitignore    # intentionally untracked files
 ├── requirements.txt     # Python file used to install runtime dependencies

store / access raw (input) data in s3
- I use vs code and have had luck using the aws toolkit
train model locally, taking a percentage of the total population
- lukas @kdnuggets does a good job explaining a few methods around how to correctly select a sample from a large dataset
create new feature branch (or clone existing model from databricks)
make some changes / develop locally (preferred ide)
push to master (standard git flow)
github actions detects change in master
github actions pushes content in target folder to databricks workspace

databricks workspace -h
Usage: databricks workspace [OPTIONS] COMMAND [ARGS]...

  Utility to interact with the Databricks Workspace. Workspace paths must be
  absolute and be prefixed with `/`.

Options:
  -v, --version
  -h, --help     Show this message and exit.

Commands:
  delete      Deletes objects from the Databricks...
  export      Exports a file from the Databricks workspace...
  export_dir  Recursively exports a directory from the...
  import      Imports a file from local to the Databricks...
  import_dir  Recursively imports a directory from local to...
  list        List objects in the Databricks Workspace
  ls          List objects in the Databricks Workspace
  mkdirs      Make directories in the Databricks Workspace.
  rm          Deletes objects from the Databricks...

specify model performance metrics in ml flow config file; model id assigned
configure databricks notebook to run on more (or all) data in your target bucket
output results from all data to target bucket
connect power bi to target bucket

imrovements to “#2”

automated model integration
reproducibility
historical record of model runs
automated selecction of top performning model (automated parameter selection)

known limitations

scripts used in datascience environment differ from those used in the databricks environment
- while continuous integration is available, it is not a seemless transition from the python scripts to the shared workspace. manual edits are required by a data engineer
no drift and model performance monitoring

notes

databricks connect, preferred ide setup instructions
as of the time of this writing, the jupyterlab integration with databricks-connect did not support windows machines

overview

recalling part 2 limitations

part 3 - automating deployment

Share on: