Home / Blogs

Email Recommendation System-Abstract: Deployment Considerations (Part III)

Perhaps, one of the most thrilling moments of any machine learning project for a data science team is learning that they get to deploy the model in a production environment. However, this can be a daunting task or a simplified one, if all the tools are readily available.

Machine-learning (ML) models “require” deployment to a production environment to deliver optimal business value, and the reality is that most models never make it to production because the deployment phase requires a longer than necessary implementation schedule. Even successfully deployed models require subject matter and domain expertise that create intense bottlenecks in engineering and operational processes.

If you look at ML models as software, then deploying and maintaining any software requires maintenance and versioning. The one inconsistency that ML Models introduce is a broader set of complex challenges. Let us review them.

Demands of ML deployment have given rise to the field of MLOps. A nascent industry that is gaining in popularity. Like how DevOps added structure to software engineering, a proper MLOps implementation orchestrates the development process and deploys ML models. Done right, these ML Models are validated and created autonomously. This post will attempt to introduce MLOps and outline deployment considerations/requirements that will help ensure ML applications make it to production and run smoothly—that is what it takes for a model to provide business value. Concepts don’t provide business value, neither do stand-alone models.

We will leave aside estimating the business value achieved by deploying such a model, though we discuss architectures that simplify decisions regarding continuous and autonomous deployments.

Acknowledging the transfer gap between the success of the initial model and the requirements before, during and after deployment is step one. The success of your email recommendation system is not based solely on just the model itself but rather on the horizontal orchestration of the deployment. To conduct these tasks optimally, we should work backward within the Machine Learning lifecycle.

Data Resiliency

The major components of the deployment phase include model monitoring or logging, maintenance, compute resources, concept drift, data drift or data degradation, and data resiliency. Data resiliency is relatively a new term coined to describe the age and relevancy of the features or schema and the age of data related to historical datasets to make predictions more accurate. The question to ask the team is how resilient this data is? Likewise, answers can vary depending on the type of model, industry, third-party data introduced, and many other factors.

A data relevancy or data resiliency score is a new “feature or variable” that can be introduced in the testing environment to make your predictions hold up during the production phase. We might ask questions like, how quickly does the data become obsolete? Why do I have 16K images of automobiles that were produced ten years ago in this dataset? How far back in time should our model run? When was the data resiliency score introduced in the dataset? The score could be based on how correlated the dependent variables are to the target variable and, again, the age of the data. So, what makes deployment hard, especially in an email recommendation system environment? Below are several factors to consider.

First Step: Register your Model

MLOPs teams are the glue between built and the success of your deployment. MLOPs works analogously with the data science teams, DevOps and software engineers to get the model production-ready. Having a model registry puts structure around the handoff between data scientists and engineering teams. It is a foundational component of any successful deployment process.

If your model is transferred to production and produces erroneous output, registries make it easy to determine which model version is causing the issue and roll back to a previous version of the model if necessary. Model registries also enable the auditing of model predictions.

When MLOps examines new and relevant pain points during the deployment phase, a solid reference to the schema is paramount, along with a historical reference of experiments to produce the original model. If you are committed to a single cloud provider, these registries are standard and make it simple to register, such as AWS SageMaker and Microsoft Azure. There are also several easy-to-implement open-source tools for registration. Probably most popular is MLFlow, which enables customization across many environments and technology stacks.

Featured Stores

Feature Stores are becoming a necessity when building models at scale like our email recommendation engine. Our partner Splice Machine understands the importance of registering your model and the significance of what Feature Stores can bring to any implementation. Schemas of your model will change continuously, and Feature Stores can make it easier to track what data is being used for the email recommendation predictions and help data scientists and ML engineers reuse features for multiple models.—If you have “Marlowe” turned on, you might be running several types of models orchestrated by one AI assistant simultaneously. The feature store provides a repository for data scientists to track features they have extracted or developed for the initial models and iterate by building new models on the fly. In other words, if a data scientist retrieves data for a model (or engineers a new feature based on some existing features)-(such as the implementation of a data resiliency score) or monitoring the “input” velocity of the end-user, they can commit that to the feature store. Once a feature has been awarded entrance into the existing feature store, it can be reused to train new models—not just by the data scientist who created it but by anyone within the organization who trains models.

Compute Resources and Auto-Scaling

Other considerations for streamlined deployment include types of predictions (Real-Time predictions vs Batch Predictions,) Compute Power, Cloud or Edge/Browser, monitoring and latency requirements and security and privacy.

Since we know we are building an email prediction recommendation system in real-time, we know that a dedicated prediction server is needed to serve up predictions in milliseconds. Campaign Builders will not wait overnight to assess their predictions, so batch processing is not a conducive way to serve up predictions in our particular environment. Further, some models working in parallel with each other might require neural networks associated with deep learning algorithms that require enormous computing power, especially the image optimization widgets inside the email editor. It is crucial at the outset to scope out what computer resources are needed.

Thankfully, the cloud offers an auto-scaling mechanism to bolster or throttle back GPU usage as necessary, but looking at some documentation from Google, this can be a little bumpy, especially during the assessment phase. The MLOPs architect and software engineers should create a plan to monitor peak usage patterns and curate the potential input-velocity and deployment patterns before full-scale production. A staggered deployment pattern approach such as shadow mode and canary deployments might provide increased accuracy and prove as a worthy strategy since shadow mode deployments are often used in conjunction with human assessment of the models’ integrity. In a continuous model building environment, shadow mode might not be favorable.

Latency

Questions to consider within this component are: What velocity are the inputs coming in at? How fast do I want to serve up my predictions? What times of day is the heaviest usage? This is all part of your observations, monitoring and logging, which will include additional investment. Observability requirements such as logging and monitoring are chief concerns.

In an email recommendation system, returning or predicting 6-10 different image enhancement choices as predictions in the “styling widget” in 500 milliseconds or less would is optimal. Styling of the images and image augmentation to produce more accurate predictions might need to be available. Determining your latency requirement for your end-users is a significant determination and will provide that “dopamine effect” that the campaign builder now suddenly craves. Moreover, as it relates to throughput, for example, if you are building the recommendation system that needs to handle 1000 queries per second, it would be helpful to test the deployment to ascertain you have enough computing power to hit the QPS requirement.

Logs

When building your email recommendation system, it may be helpful to log as much of the data as possible for analysis and review and provide supplementary data for retraining your learning algorithm in the future. The logs can act as a new introductory dataset to further measure compute power and better estimate peak traffic patterns but also to continuously build on the accuracy of the current models in production. Yet, also this data is used to identify deployment patterns for real-time model iterations. When certain thresholds are breached in your monitoring, you should receive a notification.

Security and Privacy play a significant role in determining the type of information being served up by the recommendation system. Depending on the input, the concept of the model might need to be modified slightly to accommodate any security requirements by your team. In an email prediction system like ours, security and Privacy should be straightforward with minor tweaks in “x” = inputs.

Concept Drift and Data Drift

Generally, concept drift should happen. If not, you can’t trust the model. This means the model’s predictors/classifiers (target and correlated dependent variables) might require a re-engineering effort to accommodate the deployment patterns, especially after mapping the model on the dataset in production. Concept drift may occur from many vectors.

Foundationally, the model might have performed well on the test set, but the overall concept might need some modifications when transporting the model to the deployment phase. There may be mapping constraints or even mapping enhancements. There might come a time that CTR rate in an email is no longer the target variable, for example. In our proof of concept, we merely showed 3-5 classifiers. If you have concept drift, this might show as many as 12 or more. Concept drift is healthy, but it should not disqualify your efforts or heavily constrain your timeline to deployment. Tackle concept drift early and often.

An example of Concept drift is more likely in an email recommendation system as you discover enhancements from the production data set that were not considered with the test dataset. Either way, modifications to the existing model will need to be made on the fly for the predictions to maintain integrity and precision.

The prospect of adding new or additional hyperparameters, maybe a new target variable, or a combination of both is a validation of a working model. For example, let us say we want to optimize the model for x=input, and then we might consider introducing additional target variables to introduce in order for Y=output to have a higher delta. A perfect compliment, perhaps to a CTR rate in an email optimization target variable, might be HubSpots’ overall email score.

Principally, the 3-5 variable optimization technique discussed in Parts I and II requires further conceptualization.

In a data pipeline where new data points are ingested in real-time by end-users, both inputs “x” and outputs “y”, the likelihood of data drift or data degradation might be less severe.

TensorFlow Extended

We encourage you to study TensorFlow Extended, (TFX), which makes deploying your model in any environment simpler and aggregates the above components as a one-shop stop. Earmark, especially the ontology of artifacts and metadata. The more metadata your model can store, the better explainability your model will have. Explainability = Trust.

The essential workflows and underlying systems for machine learning (ML) in production systems come in different shapes and sizes. However, one key distinction is between one-off and continuous pipelines.

Engineers initiate one-off pipelines to produce ML models “on demand.” In contrast, continuous pipelines are “always on”—the ingestion of new data produces new iterations of models in real-time. This is how we see our email recommendation engine developing.

The expectation is that a “fresh” model should be pushed to serve as frequently and timely as possible to reflect the latest trends in the incoming traffic. For example, you might want to serve up 3D images in your styling editor to enrich your editor’s feature portfolio. Our email recommendation engine for image optimization discusses data or image augmentation to see if augmented images produce a better CTR rate.

Any ML task whose underlying data domain is non-stationary can benefit from continuous training to keep models fresh. Failing to update models in non-stationary settings can lead to performance or data degradation. The frequency with which models need to be updated depends on the speed with which the underlying data evolves. Once again, the theme here is input velocity and how many different models are being used at any given time. The inventory of items representing the corpus (initial model) keeps expanding within a recommender system like ours. The initial model may have had five different types of emails as dependent variables, but as new classifiers evolve, the model needs to be updated. As a final example, the images retrieved initially in our initial model will change as new images need to be ranked and ultimately retrieved to expand the corpus for better predictions. This makes the recommendations stay fresh.

Autonomous Validation

Within TFX, any system that automatically generates new ML models must have validation safeguards in place before pushing a new model to production.

Using human operators for these validation checks is prohibitively expensive and can slow down iteration cycles. Moreover, these safeguards need to be applied at several junctures in the pipeline to catch different classes of errors before propagating through the system.

This implies more than just checking the quality of the updated model compared to the current production model. As an example, suppose that an error in the data leads to a suboptimal model. Then, whereas a model-validation check will prevent that model from being pushed to production, the trainer’s checkpointed state might be affected by the corrupted data and thus propagate errors to any subsequent warm-started models.

TFX addresses these points by employing several validation checks at different stages of the pipeline. These checks ensure that models are trained on high-quality data and are at least as good as or better than the current production model, and are compatible with the deployment environment. We will publish our email recommendation for images later this month.

By Fred Tabsharani, Founder and CEO at Loxz Digital Group

Fred Tabsharani is Founder and CEO of Loxz Digital Group, A Machine Learning Collective with an 18 member team. He has spent the last 15 years as a globally recognized digital growth leader. He holds an MBA from John F. Kennedy University and has added five AI/ML certifications, two from the UC Berkeley (SOI) Google, and two from IBM. Fred is a 10 year veteran of M3AAWG and an Armenian General Benevolent Union (AGBU) Olympic Basketball Champion.

Visit Page

Filed Under

Comments

Comment Title:

  Notify me of follow-up comments

We encourage you to post comments and engage in discussions that advance this post through relevant opinion, anecdotes, links and data. If you see a comment that you believe is irrelevant or inappropriate, you can report it using the link at the end of each comment. Views expressed in the comments do not represent those of CircleID. For more information on our comment policy, see Codes of Conduct.

CircleID Newsletter The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet

Related

Topics

New TLDs

Sponsored byRadix

Brand Protection

Sponsored byCSC

Threat Intelligence

Sponsored byWhoisXML API

Cybersecurity

Sponsored byVerisign

DNS

Sponsored byDNIB.com

IPv4 Markets

Sponsored byIPv4.Global

Domain Names

Sponsored byVerisign