Trust is critical to AI adoption.
With more deployment of next generation of AI models, building trust on these
systems becomes even more vital and difficult. For example, although with the
amazing capabilities Generative AI, LLMs are delivering, it brings along with
it the trouble of it being larger, complex, and opaque than ever. This makes
identification of the right metrics and continuously monitoring and reporting
them imperative.
Below are some of the most critical
metrics that every organization & business should be continuously monitoring
and have the capability to report them as and when necessary.
DATA
•
Date of instances
•
Date processed.
•
Owner & steward
•
Who created it?
•
Who funded it?
•
Who’s the intended user?
•
Who’s accountable?
•
What do instances (i.e., rows) represent?
•
How many instances are there?
•
Is it all of them or was it sampled?
•
How was it sampled?
•
How was it collected?
•
Are there any internal or external keys?
•
Are there target variables?
•
Descriptive statistics and distributions of
important and sensitive variables
•
How often is it updated?
•
How long are old instances retained?
•
Applicable regulations (e.g., HIPAA)
MODELS
•
Date trained.
•
Owner & steward
•
Who created it?
•
Who funded it?
•
Who’s the intended user?
•
Who’s accountable?
•
What do instances (i.e., rows) represent?
•
What does it predict?
•
Features
•
Description of its training & validation
data sets
•
Performance metrics
•
When was it trained?
•
How often is it retrained?
•
How long are old versions retained?
•
Ethical and regulatory considerations
BIAS remains one of the most difficult KPI to define &
measure. Hence, I am excited to find some measures which can contribute measuring
presence of BIAS in some format.
- Demographic representation: Does a dataset have
the same distribution of sensitive subgroups as the target population?
- Equality of opportunity: Like equalized odds,
but only checks the true positive rate.
- Average odds difference: The difference between
the false positive and true positive
- Demographic parity: Are model prediction
averages about the same overall and for sensitive subgroups? For example, if
we’re predicting the likelihood to pay a phone bill on time, does it predict
about the same pay rate for men and women? A t-test, Wilcoxon test, or
bootstrap test could be used.
- Equalized odds: For Boolean classifiers that
predict true or false, are the true positive and false positive rates about the
same for sensitive subgroups? For example, is it more accurate for young adults
than for the elderly?
- Average odds difference: The difference between
the false positive and true positive
- Odds ratio: Positive outcome rate divided by the
negative outcome rate. For example, (likelihood that men pay their bill on
time) / (likelihood that men don’t pay their bill on time) compared to that for
women.
- Disparate impact: Ratio of the favorable
prediction rate for a sensitive subgroup to that of the overall population.
- Predictive rate parity: Is model accuracy about
the same for different sensitive subgroups? Accuracy can be measured by things
such as precision, F-score, AUC, mean squared error, etc.
But considering
all the above, we must be very sensitive and be cognizant of the business &
social context while identifying our above mentioned “sensitive group.”
By no means, it
is a exhaustive list, but only a start towards a safer & fairer digital
ecosystem. I will try my best to consolidate new information.
Thanking dataiku, some information collected from dataiku report:
How to build trustworthy AI systems.