analytics application architectures - part 1

overview

How do you select which model to use?
Once selected, how do you design the right architeceture?
How do you manage the deployment of the models?
Who owns the model prior to deployment? After deployment?

More importantly, how do I build an application powered by machine learning? This quetsion, among others, need answers before determining the best fit analytics solution for your business scenario. Before we get to building, it helps to understand what tools we have to work with. Looking at how cloud service providers categorize their technology can give us a quick overview of technologies both related to, and outside of, the machine learning space. I will get a little meta and add that communities are more effective and effecient innovators, compared to their corporate counterparts. Because of this fact, it is common for AWS / Azure / GCP et al to capitalize on the open-source market. Ergo there is probably an open source alternative to the cloud service providers’ prodcts I am going to list. There is a lot to consider when selecting a tool, but I will do my best to recommend open-source projects where possible. I digress. Cloud products fall into a few categories:

compute
storage
database
migration
networking & content delivery
mobile services
developer tools
management tools
security, identity & compliance
analytics
application services
messaging
internet of things
support, customer
machine learning / modeling
game development
media service
application integration
cost management
containers
container management

If you are insterested, I have a working document with a compiled list of tools here.

Now that we have a good idea of the tools that are available, we can being to define out service requirements. Hakon does a good job (1) clarifying the learning and prediction contstraints and (2) describing various business scenarios in this quora post.

Worth noting how Hakon breaks down learning into (1) offline and (2) online training and prediction into (1) batch and (2) on-demand. I’ll add that ‘on-demand’ could be quantified to a specific service level agreement. E.g. from point new data enters source, serve results in less than time t. Constrains, such as team skill set, may also shapre your optimum service requirements. Can your data engineers productionalize a jupyter notebook? Can they seamlessly convert python to scala? Or do your data scientist need to deploy their own work? Do you have a machine learning engineer? Etc.

Once service requirements and constraints are defined, we can break our analytics project - one where we need to build and productionalize a model and surface its results - into nine (9) components and begin selecting our ideal toolset:

integration - build ingestion and distribution pipes
modeling - design best-fit model; varies based on model complexity
version control - track changes to models, parameters and data
computation - determine appropriate compute resources; can be local or distributed; qty / magnitude based on business need
storage - store model input, parameters, results and applicable performance metrics; can be local or distributed
hosting - ensure reliability and accessibility of application
configuration - manage application dependencies; e.g. is [model] in [container]? is [package] [version] installed in [environment]?
monitoring & retraining - assess mode performance at time t and retrain, if applicable
scheduling - synchronize ingestion, model and business timing needs
visualization - transfer knowledge to business domain; surface model results to target end user(s)

As noted above in the toolkit, there are many tools that specialize in these componet categories. depending on (1) the current state of your architecture (2) your constrains and (3) what you are trying to achieve, you may want to make tradeoff between cost, flexibility and usability. Below are a few solution architecture ideas and which business case they might fit in.

stack selection

part 1 - quick and easy analytics

overview how do I get this up and running as quickly as possible?

python + dask - language of choice; package for distributed computing using pandas & numpy lib formats
github - version control, source of truth
s3 - blob storage (model input + output)
power bi desktop

general flow

install python (or anaconda), dask and your prefered IDE

create repo / directory structure in github

basic repo structure

 ├── LICENSE
 ├── README.md     # Top-level README for developers using this project.
 ├── data          # ingestion commands; templates for calling various data sources
 │   ├── external  # connection to data from third party sources, if applicable.
 │   ├── interim   # connection to intermediate data that has been transformed.
 │   └── raw       # connection to the original, immutable data dump.
 │
 ├── src           # trained and serialized models
 ├── notebooks     # Jupyter notebooks to assist with documentation
 ├── configuration # toml, yml, or otherwise
 ├── .gitignore    # intentionally untracked files
 ├── requirements.txt     # Python file used to install runtime dependencies

store / access raw (input) data in s3
- I use vs code and have had luck using the aws toolkit
train model locally
store model output in s3
- make sure your file format is appropriate for your selected visualization tool
push changes to master when ready to productionalize
setup power bi
- as described here there are two methods for connecting to s3 from power bi, (1) using an odbc data source or (2) call the aws s3 api direectly through the the power bi web connector.
- I will go into more detail here on another post

cababilities

process large data sets
version control
virtually unlimited storage
begin generating insights
get people excited about “the possible”

limitations

while version control exists for code, data version control does not exist
limited reproducibility
low data sciecne capacity; inability to iterate on models quickly
PII is vulnerable
manual updates required to update dashboard
no drift and model performance monitoring

overview

stack selection

part 1 - quick and easy analytics

Share on: