Ideas for good: AI trust

Monday, 20 November 2023

Evaluating the success of an AI&ML use case

Data science team has finished development of the current version of the ML model & has reported an accuracy or error metric. But you are not sure how to put that number in context. Whether that number is good or not good enough.

In one of my previous blogs, I have addressed the issue of AI investment and how long before the business can know if the engagement has some potential or it is not going anywhere. This blog can be considered an extension of the above-mentioned blog. If you haven’t checked it out already, please visit : https://anirbandutta-ideasforgood.blogspot.com/2023/07/investment-on-developing-ai-models.html

In my previous blog, I spoke about the KPIs like Accuracy and Error as thumb-rules to quickly assess on the potential success of the use case. In this blog, I will try to add more specificity or relativeness to it.

Fundamentally, to evaluate the latest performance KPI of your AI&ML model, there are 3 ways you can go about it, in independence or in combination.

Consider human level performance metric.

For AI use cases which has the primary objective of replacing human effort, this can be considered the primary success metric. For example, if for a particular process the current human error stands at 5%, and the AI can have less or equal to 5% error rate, it can be determined a valuable model. Because AI with the same error rate, bring along with it - smart automation, speeding up the process, negligible down-time etc.

Example: Tasks which needs data entry can easily be replicated by AI. But the success criteria for AI does not need to be 100% accuracy for adoption, but just must match the accuracy which the human counterpart was delivering, to be adopted for real word deployment.

Base Model metric

In use cases for which problem areas getting addressed are more theoretical in nature, or the discovery of the business problem that can get addressed is in progress, its best to create a quick simple base model and then try to improve the model with each iteration.

For example: Currently I am working on a system to determine if a content is created by AI or not. For the lack of any past reference based to which the accuracy can be compared, I have taken this approach to determine the progress.

Satisfying & optimizing metric

We outline both, a metric that we want the model to do as good as possible (we call this optimizing metric) while also meeting some minimum standard which makes it functional and valuable in real life scenarios (we call this satisfying metric)

Example: For Home Voice Assistant, the optimizing metric would be the accuracy of a model hearing exactly what someone said. The satisfying metric would be that the model does not take more than 100 ms to process what was said.

Wednesday, 18 October 2023

AI TRUST & ADOPTION – THE METRICS TO MONITOR

Trust is critical to AI adoption. With more deployment of next generation of AI models, building trust on these systems becomes even more vital and difficult. For example, although with the amazing capabilities Generative AI, LLMs are delivering, it brings along with it the trouble of it being larger, complex, and opaque than ever. This makes identification of the right metrics and continuously monitoring and reporting them imperative.

Below are some of the most critical metrics that every organization & business should be continuously monitoring and have the capability to report them as and when necessary.

DATA

• Date of instances

• Date processed.

• Owner & steward

• Who created it?

• Who funded it?

• Who’s the intended user?

• Who’s accountable?

• What do instances (i.e., rows) represent?

• How many instances are there?

• Is it all of them or was it sampled?

• How was it sampled?

• How was it collected?

• Are there any internal or external keys?

• Are there target variables?

• Descriptive statistics and distributions of important and sensitive variables

• How often is it updated?

• How long are old instances retained?

• Applicable regulations (e.g., HIPAA)

MODELS

• Date trained.

• Owner & steward

• Who created it?

• Who funded it?

• Who’s the intended user?

• Who’s accountable?

• What do instances (i.e., rows) represent?

• What does it predict?

• Features

• Description of its training & validation data sets

• Performance metrics

• When was it trained?

• How often is it retrained?

• How long are old versions retained?

• Ethical and regulatory considerations

BIAS remains one of the most difficult KPI to define & measure. Hence, I am excited to find some measures which can contribute measuring presence of BIAS in some format.

Demographic representation: Does a dataset have the same distribution of sensitive subgroups as the target population?
Equality of opportunity: Like equalized odds, but only checks the true positive rate.
Average odds difference: The difference between the false positive and true positive
Demographic parity: Are model prediction averages about the same overall and for sensitive subgroups? For example, if we’re predicting the likelihood to pay a phone bill on time, does it predict about the same pay rate for men and women? A t-test, Wilcoxon test, or bootstrap test could be used.
Equalized odds: For Boolean classifiers that predict true or false, are the true positive and false positive rates about the same for sensitive subgroups? For example, is it more accurate for young adults than for the elderly?
Average odds difference: The difference between the false positive and true positive
Odds ratio: Positive outcome rate divided by the negative outcome rate. For example, (likelihood that men pay their bill on time) / (likelihood that men don’t pay their bill on time) compared to that for women.
Disparate impact: Ratio of the favorable prediction rate for a sensitive subgroup to that of the overall population.
Predictive rate parity: Is model accuracy about the same for different sensitive subgroups? Accuracy can be measured by things such as precision, F-score, AUC, mean squared error, etc.

But considering all the above, we must be very sensitive and be cognizant of the business & social context while identifying our above mentioned “sensitive group.”

By no means, it is a exhaustive list, but only a start towards a safer & fairer digital ecosystem. I will try my best to consolidate new information.

Thanking dataiku, some information collected from dataiku report: How to build trustworthy AI systems.

Ideas for good

What I write about

Monday, 20 November 2023

Evaluating the success of an AI&ML use case

Wednesday, 18 October 2023

AI TRUST & ADOPTION – THE METRICS TO MONITOR

Total Pageviews