Machine Learning System Ticket Examples
Monitoring an enterprise deployed machine learning solution requires a stack capable of monitoring multiple systems and processes simultaneously.
CtiPath’s monitoring stack categorizes alerts based on experience areas in order to aid in prioritizing and troubleshooting issues. (See “Enterprise-ready MLOps (eMLOps): Considering the machine learning (ML) experience areas“.)
Click each tab below to see an example support ticket from CtiPath’s monitoring stack. Observe how the system indicates the affect on each “experience area”.
Example Support Tickets for a Machine Learning System:
Sample Ticket for Data Drift Detection
Error: 2024-09-24 12:05:37,123 – **drift monitor** – ERROR – Data drift detection failed for dataset ‘sales_data.csv’. Traceback (most recent call last): File…
Description: During a scheduled data drift detection using **drift monitor** for our ‘sales_data.csv’ dataset, the system reported a failure due to a type mismatch for the ‘price’ feature. The current dataset has ‘float’ values for the ‘price’ feature, while the reference dataset has ‘int’ values. This caused the data drift calculation to fail, halting the generation of the report.
Experiences Affected:
1. User Experience: Users may experience incorrect or outdated model predictions as the model might not adapt well to changes in the data distribution due to undetected data drift. This could lead to inaccurate outputs, potentially decreasing user satisfaction and trust.
2. Business Experience: Business decision-making could be compromised if insights from models using this dataset become unreliable due to unnoticed data drift. This could affect sales forecasting, pricing strategies, or other key business metrics derived from the model.
3. Technology Operations Experience: The operations team may encounter delays in scheduled monitoring processes. Without the drift detection report, it becomes difficult to determine whether the deployed models are still operating within acceptable performance boundaries, which could lead to maintenance overhead or increased downtime.
4. Data Operations Experience: The data team is directly impacted by the type mismatch issue. They will need to manually inspect and correct data type inconsistencies between datasets. This could slow down pipeline efficiency and necessitate more rigorous data validation steps.
Troubleshooting Steps:
1. Review Data Types: Inspect the current and reference datasets to verify the data type of the ‘price’ feature. Ensure that both datasets have consistent types (e.g., both should be ‘float’ or both should be ‘int’).
2. Data Preprocessing: If needed, apply data preprocessing to convert the ‘price’ feature in one of the datasets to match the type in the other dataset. This can be done by either casting the type or imputing the values appropriately.
3. Re-run Data Drift Detection: After resolving the data type mismatch, re-run the data drift detection process to verify if the issue is resolved.
4. Implement Type Validation: To prevent future occurrences, implement a validation step in the data pipeline to ensure consistent data types across current and reference datasets before running drift detection.
5. Check **drift monitor** Configuration: Verify the **drift monitor** configuration settings for any additional parameters that could help handle type mismatches more gracefully in the future.
Sample Ticket for Bias Detection
Error: 2024-09-24 14:22:49,456 – **bias detector** – ERROR – Bias detection failed for dataset ‘customer_data.csv’.
Description: During a scheduled bias detection run for the customer_data.csv
dataset using **bias detector**, the system encountered an error and was unable to complete the bias analysis. The error message suggests that the bias detection process failed, potentially due to a missing or incorrectly formatted protected feature. This failure prevents us from assessing whether the deployed model exhibits any bias based on key demographic or sensitive attributes.
Experiences Affected:
1. User Experience: Users may be negatively impacted if undetected bias is influencing model decisions. This could result in unfair or biased treatment for certain user groups, eroding trust and leading to potential ethical concerns or complaints.
2. Business Experience: Bias in machine learning models can expose the business to legal and reputational risks, especially in domains like customer service, hiring, or financial services. Without proper bias detection, the company risks unknowingly making biased decisions that may harm brand image and customer relationships, leading to revenue loss or regulatory scrutiny.
3. Technology Operations Experience: The technology operations team depends on bias detection reports to ensure that models comply with ethical AI standards and performance benchmarks. Failure to detect bias complicates monitoring efforts and could necessitate more manual interventions to ensure fairness, adding to operational overhead.
4. Data Operations Experience: Data engineers and data scientists will need to review the data pipeline and the protected features involved in bias detection. The failure may be due to a missing or misconfigured feature (e.g., gender, age), requiring them to investigate and correct data sources. This leads to delays in completing bias assessments and could introduce inconsistencies in the model’s evaluation processes.
Troubleshooting:
1. Check Protected Features: Verify if the required protected feature (e.g., gender
, age
, etc.) is present and properly formatted in the dataset. Ensure the feature used for bias detection is included in both the training and production datasets.
2. Inspect Data Quality: Review the dataset for any missing or null values, incorrect data types, or inconsistencies in the protected features. This could involve checking for common preprocessing issues that may have led to the feature being dropped or misinterpreted by the system.
3. Reconfigure **bias detector**: Check the configuration settings in Evidently AI to ensure the correct protected feature is specified for bias detection. Make any necessary adjustments to the pipeline to ensure these features are being evaluated correctly.
4. Re-run Bias Detection: After addressing the protected feature issue, re-run the bias detection to generate the required report. This will help determine if the error was resolved and if the model is compliant with fairness requirements.
5. Establish Automated Feature Validation: Set up an automated validation process in the data pipeline to ensure that protected features are always present and correctly formatted before running bias detection. This can prevent future errors and ensure continuous compliance with bias monitoring.
Sample Ticket for Infrastructure Error
Error: Problem: High CPU utilization on server ‘web-app-server-01’
Description: **infrastructure monitor** has reported high CPU utilization on web-app-server-01
, with usage consistently exceeding 90% for more than 5 minutes. This elevated CPU usage could lead to performance degradation across systems and applications hosted on this server. Immediate action is required to prevent system slowdowns, potential downtime, or failures.
Experiences Affected:
1. User Experience: Users may face slower response times or timeouts when interacting with applications hosted on the affected server. This can lead to frustration, reduced satisfaction, and possibly affect overall user engagement and trust in the system.
2. Business Experience: High CPU utilization can impact business-critical applications, potentially leading to delays in transactions, reports, or service delivery. This could affect business operations and lead to missed opportunities, negatively impacting revenue and productivity.
3. Technology Operations Experience: The technology operations team will face increased pressure to resolve the issue swiftly to maintain system stability. The high CPU load may prevent proper monitoring, backup, or maintenance activities, impacting the server’s overall reliability and leading to increased operational overhead.
4. Data Operations Experience: For data operations, high CPU utilization can slow down or interrupt critical data pipelines, ETL processes, or data analysis jobs running on the server. This can delay data processing, reporting, and insights, reducing the efficiency and reliability of the data-driven workflows.
Troubleshooting:
1. Check Running Processes: Review the running processes on web-app-server-01
to identify resource-heavy or stuck processes that may be consuming excessive CPU.
2. Review Application Logs: Analyze the application logs for any anomalies or errors that may explain the increase in CPU usage, such as inefficient queries, memory leaks, or high request volumes.
3. Scale Resources: If the CPU load is due to high traffic, consider scaling up the server resources or adding additional server instances to balance the load.
4. Restart Critical Services: Restart services that may be stuck in resource-consuming loops or have memory leaks, after evaluating their impact on the running applications.
5. Monitor CPU Usage Trends: Implement additional monitoring to track CPU usage patterns and correlate with specific times, user actions, or processes, helping to diagnose and prevent future issues.
Sample Ticket for Application Error
Error: [2024-09-24 15:32:11] [error] [client xxx.xxx.xxx.xxx] Request exceeded the limit of 10 internal redirects due to probable configuration error.
Description: **application monitor** has reported a web application error in which the server exceeded the limit of 10 internal redirects for a client request. This error is typically caused by a misconfiguration in the URL rewrite rules or a recursive redirect loop in the server configuration. The request originated from the client IP xxx.xxx.xxx.xxx, and the error suggests a potential problem in the
.htaccess
or the web server configuration.
Experiences Affected:
1. User Experience: Users may encounter issues accessing specific pages of the web application. This could manifest as errors such as “Too many redirects” in their browser, leading to frustration and an inability to complete tasks like browsing, logging in, or making transactions.
2. Business Experience: If the error impacts a critical user-facing section of the site (e.g., product pages or checkout), it could lead to lost sales, reduced customer engagement, and harm to the company’s reputation. Business metrics such as conversion rates may be affected due to site unavailability or performance issues.
3. Technology Operations Experience: The technology operations team must address this issue quickly, as it affects the availability and functionality of the web application. This issue also impacts server performance, as multiple redirects can increase the load on the server, leading to potential downtime and higher operational costs.
4. Data Operations Experience: If this issue affects data collection points (e.g., APIs for logging user actions or transactions), it could disrupt the flow of data into databases or analytics systems. This could result in incomplete or missing data, skewing reports and insights that depend on accurate tracking.
Troubleshooting:
1. Check Rewrite Rules: Inspect the .htaccess
or server configuration files for any recursive rewrite rules. These rules may be causing the redirection loop.
2. Limit Internal Recursion: Consider increasing the LimitInternalRecursion
setting temporarily to prevent this error from occurring while the root cause is being investigated. However, this is only a stop-gap solution.
3. Review Server Logs: Analyze additional error and access logs in **application monitor** to identify specific URL patterns or requests that triggered the excessive redirects. This will help pinpoint the problematic rule or configuration.
4. Test Fixes in Staging: Apply any changes to the rewrite rules in a staging environment to ensure they resolve the issue without causing new problems. Once validated, roll out the fix to production.
5. Monitor System Performance: After the fix, monitor the server’s CPU and memory usage to ensure the issue is resolved without negatively impacting performance.