Machine Learning Workflow: A New Product Category Is Born

Information Week
Jul 12, 2018
3 min read

Machine learning (ML) is being touted as the solution to problems in every phase of the software development product lifecycle, from automating the cleansing of data as it is ingested to replacing textual user interfaces with chatbots. As software engineers gain more experience in developing and deploying production quality ML solutions, it is becoming clear that ML development is unique compared to that of other types of software.

The ML engineer creates experimental models, runs them on small samples of data, and shares those models with domain experts and data scientists for feedback, using notebook tools like Jupyter or Zeppelin. Once the team has decided on a model that is worth scaling, the next step is to ingest, cleanse, and de-duplicate the data. Once cleansed, the data is divided into training data, which will be used to tune the model, and validation data, which will be used to validate the model.

The ML engineer trains the proposed model by feeding it large volumes of data. To accelerate this process, the training is run in parallel across many processors with the intermediate results combined at the end of the process. This phase can be iterative and may require tweaking the model and then re-starting the training. The training may also need to be re-run at regular intervals after deployment to update the model or isolate a problem. This requires rolling back not just the model, but also the training data, a feature that traditional source control systems are not designed to handle.

The team then tests the accuracy of the trained model by running data through it and comparing the model’s predicted results with the actual results. Once the team is satisfied with the trained model’s accuracy, the model must be integrated with the target application and deployed on a scalable infrastructure so that it can respond to requests in production. Depending on the type of model and the deployment environment’s performance requirements, this may require mechanisms such as horizontal scaling, caching results and/or deploying parallel versions of the model in multiple containers.

Another distinguishing characteristic of ML software is that it is far more brittle than traditional software. ML algorithms are non-deterministic in nature and are highly sensitive to the characteristics of the data with which they were trained. If those characteristics change, the model may lose its accuracy and need to be replaced by an alternative model.

Another cause of ML software’s brittle nature is the fact that every step is tightly dependent on every other step, so the norm is “Change Anything Changes Everything.”

To meet these challenges, many engineering teams have taken existing open source tools and wired them together to create a “roll your own” ML operational environment, using tools such as Jupyter (ML notebooks), AirFlow (data pipelines), Docker (containerization), and Kubernetes (container orchestration). But, for some teams, the potential costs and complexity of this approach may not be a good fit. As an alternative, a new category of products has emerged that provide an end-to-end ML operational environment. Products in this category include:

Amazon SageMaker: a fully-managed platform that enables developers to easily build, train, and deploy machine learning models at scale.
Yhat ScienceOps: an end-to-end platform for developing, deploying and managing real-time ML APIs.
Pachyderm: an environment that automates all stages of developing machine learning pipelines.

These products can vastly simplify the process of creating and deploying ML algorithms with a few caveats:

These products enable a ML team to deploy a ML algorithm in production. The question must be asked: Is this desirable? Does the team have the requisite operational experience for dropping code into your production environment based on intermediation from an automated software tool?
These products are new and have some rough edges in areas like stability and performance (like any new product). A good rule of thumb: Always do a proof of concept to see how the product works in your environment.
If you adopt one of these products, you are locked into that product’s roadmap. So, they may speed up your initial time to market, but it can impact your flexibility down the road.
Many of these products have an open source version. But, if you intend to use the product in production, you’ll quickly discover that you need the enterprise version.
Some of these products may suffer from a lack of focus, as they try to expand and solve problems beyond the ML development process. Make sure whatever product you choose can provide the depth of capabilities you need.

ML is poised for explosive growth in the enterprise, and ML workflow environment tools like the ones described above lower the barrier to entry. It will be interesting to see how this product family matures in the coming months.

Machine Learning Workflow: A New Product Category Is Born

Comments

RECENT POST

6 Tech Trends for the Enterprise in 2019

Reality Check: Looking at Use Cases for AI, IoT and Blockchain

Instant Data Science

The Power of Predictive Analytics

When blockchain doesn't fit

Machine Learning Workflow: A New Product Category Is Born

Data Ingestion Best Practices

Artificial Intelligence Is Not Just Hype – It’s Real

Amazon Fills a Big Data Hole with Athena

Is Data Vault Modeling a Good Choice for Your Organization?

Amazon, Whole Foods and the future of shopping

Serverless Computing: Ready for Prime Time

DevOps: Why getting the culture right is the key to success

An easy fix for the airline overbooking problem

Thank you for not adopting microservices

How to cope with open source data in the age of artificial intelligence

IT and Business: In Search of the Golden Mean

Web data extraction: Custom, commercial offerings ease the task

Interview with Moshe Kranc, CTO at Ness Digital Engineering

Cybersecurity must be open, replaceable

Flink: Worth a Second Look

Monetizing Data: In Search of the Virtuous Circle

Fog Computing: An Alternative to Big Data Lakes

Four Lessons Learned From Delta's Power Outage

Apple Buys Turi: Why It Matters