API Docs & Tools Blog Help
Getting AI service in maintenance mode - Intuit Developer Community Blog

May 4, 2021 | Ido Mintz

Getting AI service in maintenance mode

Artificial intelligence (AI) is being adopted by companies the world over. According to Semrush AI statistics, 86% of CEOs say AI is mainstream technology in their offices, 54% of executives say implementing AI has increased productivity, and 79% of executives think AI makes their jobs easier and more efficient. AI-based services for customers continue to grow and emerge, as well, and data scientists are working hard to use advanced tools and algorithms to keep improving them. 

That being said, at some point it’s time for us, as developers, to find new problems to tackle, while ensuring that our AI service continues to deliver its expected impact and performance without human intervention. Here are some tips to guide you on which components an AI service should have in order to maintain its independence.

Maintaining artificial intelligence services independently

1. A well-defined goal

At Intuit® , our products are built to Design for Delight (D4D). This process ensures that we are working on the most important customer problems, and maps the path from the current state to the ideal state which will lead to improvement from <from_value> to <future_value>.

Along with the development cycle described well by Shir Meir Lador in Lessons learned leading AI teams, the key is to validate the model performance in light of the goal it is aimed to achieve. For that, we use a silent control group – a population that is scored by the model, but doesn’t get the treatment the model is advising to apply. Having a well-defined silent control set is a critical capability for evaluating the model suggestions. It provides the benchmark ground of the state of the world without using the model, and also allows us to create an easy simulation of what would have happened if we had used the model.

2. Monitor the service’s behavior against expectations

Monitoring the service needs to be done on four dimensions:

  1. Customer Impact – How is the customer problem being mitigated: Current state in comparison to the base state (e.g., money being saved or fraud being reduced). Ideally, you have a clear dashboard showing the extent of the customer problem over time, and you are able to identify how releases of your AI solution are influencing that metric. Clearly, an increase with this metric suggests that your service is not functioning well enough to mitigate the customer problem and that a new version is needed.
  2. Engineering excellence – Make sure that everything is working, such as SLA and error handling. These set of metrics monitor that your service performs well from an operational standpoint.
  3. Model health – Features and model scores distribution. It is possible that your service is working without delays or errors, but the distribution of the features it consumes is changing over time. This could be because something in the process of saving the data in your org or calculating the features is changed in a way that changes the output distribution of your service. It is crucial to be alerted to that, because if your AI model is trained with a different transformation process, its predictions will stop being as such. Features distribution may also change because your users adapted their behavior due to the model, a process which is described as concept drift. This is common in the security risk and fraud space as fraudsters learn our defense system and find new vulnerabilities to utilize.
  4. Model performance – How is the model performing against the ground labels (e.g., precision, recall and accuracy). These metrics are usually being measured against a randomized control group in which events are being scored but no action is applied. There are numerous ways to randomize allocation of control groups, and you should consider the randomization method that creates the best representation of the events in real life.

Leveraging Splunk and the new ML toolkit can help you easily set up monitoring dashboard and alerts for your AI service. I recommended using a function that builds your response and just before returning it, log it. Here is an example:

import json

import logging

logger = logging.getLogger(‘predictor’) logger.setLevel(logging.INFO)

handler = logging.StreamHandler(sys.stdout)

handler.setFormatter(logging.Formatter(‘%(asctime)s – %(levelname)s – %(message)s’)) logger.addHandler(handler)

def log_response(tid, score, version, execution_time):

     result_dic = {“prediction”: <YOUR RESPONSE>,

                    “version”: version,

                              “execution_time”: execution_time}

     result_json = json.dumps(result_dic)

     logger.info(f’result={result_json}’)

     return result_json

This format makes sure Splunk gets the model’s response, plus some meta data. These data include the version of the model and the execution time which will help analyze how different versions are performing.

3. Alert protocol and escalation

Monitoring dashboards are great as it allows us to understand what is happening. However, remember that in maintenance mode, we want to reduce human attention to a minimum, so it is crucial to be alerted when to look and what to look on. For each of the four dimensions listed above, you should have some expectation on your service’s behavior in day-to-day life. Your expectations can be static or changing depending on weekdays, and periods, but the key fundamental is to predict the service behavior. This can be gained by using the test data that was used to evaluate the service’s performance or some period your service is live and running with no action attached.

Too many alerts will be considered a crying wolf; too little will break your credibility. The line is not very clear, and it is all about balance so work with your partners to define the right alert sensitivity. Every alert should be well defined as to when it should be raised, who should receive it, and what can be done to mitigate it.

It’s important to put some thinking about the risks associated with relying on an AI-based service early in the developing stage. For example, a high false positive rate may mean that your service prevents your organization from operating, making customers dissatisfied which could lead to a serious reputation crisis. This could happen when your service declines for transactions or loans, which are the fundamental product your organization sells.

High decline rate may indicate something in your model has broken or your organization is under some kind of attack. The dashboard should allow you to quickly understand your current position. In case an attack happens, you’ll want to alert the finance department to carefully review loan applications. In case something is broken, I recommend having a fast plug to bypass the service, and allow the business to continue operating even in the costs of some excess loss. This is especially important in the case of artificial automation services as it is not possible to retrain an AI model in the middle of the night in hours.

The minimum metric I recommend having an alert for are:

  • Decline alert rate – Anomaly detection Splunk MLTK 4.0 allows you to easily set these alerts and let your monitoring smooth the thresholds using a rolling time frame. You can also add a static trigger that will notify your team when the alert rate is higher than expected.
  • Errors – It’s good practice to wrap your prediction code with a try and catch statement to catch errors in a methodical manner and track it structurally. Specifically, I suggest you log and return the error with some extreme negative score like score = -400 and then build a trigger to catch these negative scores and send an alert.

Try:

<Your code>

except Exception as e:

   return response(source, mt_txnid,

                   score=-400,

                   reason=”Unexpected error: ” + str(e),

                   version=version,

                   execution_time=prediction_time(start))

  • Latency – We want our models to respond robustly even when traffic increases dramatically. The server that calls your model probably has a timeout to wait for the response, so delays in answering may bypass your recommendation. In Intuit, we use auto scaling with AWS Sagemaker to deliver great performance, but these scaling solutions take time to respond sometimes, and it is always good to make sure that TP99 (response time for 99 percentile) is within the expectations.
  • Performance – Since AI models tend to degrade in performance over time, it is important to track the performance against the expected performance in the lab. Sometimes, you will need to have some control set which is scored by your model but is not applied with the model recommendation so that a label can be retrieved. I especially like to monitor Precision/Recall metrics, but you should keep using the metrics you optimized generally to know when something is going wrong.
  • Score distribution This allows you to detect sudden shifts in behavior that could indicate a shift in usage or a bug with specific features. As it is hard to set alerts on distribution, I would recommend setting alerts on specific interesting percentiles of daily scores like percentile 90 or percentile 95.
  • Score distribution This allows you to detect sudden shifts in behavior that could indicate a shift in usage or a bug with specific features. As it is hard to set alerts on distribution, I would recommend setting alerts on specific interesting percentiles of daily scores like percentile 90 or percentile 95.

Since models tend to degrade, it’s important to have an automated process that runs every period to re-train the model with new data. To decide on the period to run the training, I recommend you try to estimate how fast performance should reach a certain minimum. Also, be aware that a sudden shift in user behavior may happen that will require re-training. I found training on a weekly basis a good balance between having a fresh version ready and cost overhead.

4. Automated training pipeline with self-validation

In this context, it is also important to track the performance of each of the training periods. Ideally, at the end of the training pipeline, add a step to validate the performance on your control group keeping out of time validation set to allow the best prediction as to what will happen when the model is deployed. In Intuit, we use Mlflow to track each run metrics, which makes it easy to select which model to deploy and to track ongoing training results.

5. Normalize model scores – Glassifier

It is important to consider how to make the switch of versions during production. Best practice is to have a gradual ramp up process where we start from having the new version run in shadow mode, which means all transactions are scored by both versions, but decisions are only made by the current version. In Intuit, we call this the “Champion/Challenger ramp up,” meaning the new version (Challenger) is trying to challenge the current one (Champion).

Next, if the Challenger beats the Champion in shadow, we start allowing the Challenger to make decisions on a portion of traffic (let’s say 30%). Then, you are able to easily compare the impact between traffic groups, and you should be able to make an informed decision that the Challenger beat the Champion, and then they are ready to be replaced.

One problem that may arise is that the probability score does not always accurately represent the expected desired metric the org is trying to optimize against (e.g., precision or recall, accuracy). This usually requires the data scientist to present the expected performance for certain thresholds and change the action trigger for each of the versions. Hence, it is good practice to normalize the model score in terms of the desired metric. This also spares the avoidable discussion around score interpretation.

When applying this method on new versions of the model, policy should continue to expect the same precision from the two models. The question left is, “Which one of them better delivers against the recall on the live traffic?”

In Intuit, we use this practice via Glassifier, a Python package we developed to normalize model scores. Glassifier makes your model scores transparent to policy application, hence allowing a swift transition from Champion to Challenger without changing the action trigger.

All set? Now you cleared your time to focus on a new challenge making more customers happy and power prosperity around the world with AI. Go get them!

Artificial intelligence: Helping companies succeed

AI-based services help companies perform tasks faster and better. The idea is to reduce the need for human effort and direct that effort toward other business-growing functions. Reducing the amount of time needed to maintain the services is key, and the information supplied today can help you do just that.

To learn more about artificial intelligence, check out Shir Meir Lador’s advice on avoiding conflicts and delays in the AI development process, Part I and Part II.