As banks, fintechs, regulators and consumer advocates debate the benefits and risks of using artificial intelligence in lending decisions, one point of contention has emerged: Does there have to be a tradeoff between accuracy and fairness?
That point came up in the course of an independent analysis of AI-based loan software provider Upstart Network, but it nonetheless applies to all banks, credit unions and fintechs that use AI models in their lending decisions.
From 2020 through 2024, law firm Relman Colfax monitored Upstart’s fair lending efforts at the behest of the NAACP Legal Defense Fund and the Student Borrower Protection Center. In a final report published earlier this year, Relman Colfax said Upstart made a lot of effort to ensure its lending models are fair.
However, the report found that the parties came to an impasse at one juncture, when Relman Colfax thought Upstart could tweak its model to approve more loans to disadvantaged groups, but Upstart said making that change would diminish the model’s accuracy.
“This issue is critical,” the report said. “If a fair lending testing regime is designed around the assumption that a less discriminatory alternative model cannot be viable unless its performance is exactly equal to a baseline model on a chosen performance metric (regardless of uncertainty associated with that metric), less discriminatory models may rarely, if ever, be adopted.”
Upstart voluntarily paid for all the work with Relman Colfax “to make sure that we were turning over every rock that we could in trying to find the fairest version of our credit risk models,” said Nat Hoopes, head of public policy at Upstart, in an interview.
Upstart and Relman Colfax had the same goal, Hoopes said.
“We think that if you can put in place a less discriminatory alternative model, you should do so,” he said. He added that Upstart came up with a different change that improved its model’s fairness without degrading accuracy, and noted that there is no one prescribed method for this.
“This is the thorniest, most difficult question in algorithmic fairness right now,” said Kareem Saleh, founder and CEO of FairPlay AI, a company that conducts fairness testing on AI models. “What you see in the impasse is actually two really thoughtful groups trying to grapple with this question.”
Relman Colfax takes the view that if the model accuracy bar is set too high, no one will ever use a less discriminatory alternative and the bias baked into the existing system will be encoded in AI-based lending models.
Upstart, on the other hand, sells its loans to banks that expect a high degree of confidence in risk model performance. If the model accuracy bar is set too low, it won’t be able to raise money or sell its loans.
“There’s no easy, obvious answer to this question, frankly,” Saleh said. “This is the kind of thing about which reasonable people could disagree and it’s not a question where I think the regulators will save us.”
Several experts interviewed for this article said that if regulators get prescriptive about loan decision model accuracy and fairness metrics, then lenders will just try to game the system to meet those metrics.
The search for less discriminatory alternatives
The doctrine of disparate impact requires that when banks use models — AI-based or not — in lending decisions,
Searching for a less discriminatory model doesn’t necessarily mean going out and looking for a different model that’s less discriminatory. Generally speaking, it means experimenting with existing loan decision models and changing variables or the weighting of variables to see if they will increase loan approvals for protected classes.
For instance, if a loan decision model puts a heavy weight on consistency of employment, then women who dropped out of the workforce for a few years to raise children will not be able to get loans. To make that model less discriminatory, the model’s weighting of employment consistency could be reduced.
Or if a model heavily weighs whether or not a potential borrower has a bank account, that will negatively impact Black people who live in bank deserts and don’t have bank accounts, and who end up using check cashers and prepaid cards. Here again, modifying that variable or its weighting could help lenders approve more loans.
How much effort a lender needs to make to ensure its models are inclusive and fair is not prescribed by any existing law or regulator. Nor is it spelled out how much consideration a bank can reasonably give to model accuracy.
“There’s a lot of squishiness in terms of what it means to reasonably meet a business’s legitimate business interest,” said Stephen Hayes, partner at Relman Colfax, in an interview. “It requires defining what that interest is, having some method for figuring out whether your model is furthering that interest and then thinking about whether alternatives would comparably serve that interest but cause fewer disparities.”
Testing AI models for fairness and accuracy
Testing AI models for fairness and accuracy is an art, not a science. AI-based lending decision models are tested for accuracy by back-testing them on past loan decisions — if this adjusted model had been used back then, who would have gotten a loan and who wouldn’t have, and how would those loans have performed? In other words, how well would this model have predicted which borrowers defaulted?
Once any lending model is used in real life, its accuracy — in other words, its ability to correctly predict default — will vary due to changes in populations, the economy and other factors. For instance, the models Wall Street quants used to predict mortgage loan performance in the years leading up to the mortgage crisis failed. The quants did not anticipate the speed and recklessness with which subprime mortgage loan loans would be made by large lenders like Countrywide to feed the demand of Wall Street firms that packaged up the loans and sold slices of these collateralized debt obligations to investors, and the devastation that occurred when that house of cards fell down.
“Models will fail and on different populations they’ll be way off,” Hayes said. “If you act like the performance metric when you develop the model is the truth, and one half a percentage difference in that performance metric is going to be the difference between making a profit and not making a profit, you’re probably kidding yourself.”
Yet lenders have a legitimate business need to make their models as accurate as possible, to avoid lending money to people who can’t repay.
“There hasn’t been any indication that prudential regulators are comfortable with sacrificing credit risk prediction, the accuracy of whether or not somebody’s going to default,” Hoopes said. “Those of us who lived through the subprime mortgage crisis remember the problems of, ‘Hey, we don’t even care whether somebody’s going to be able to repay this loan. We’re just trying to pump out volume.'”
Lenders already use models that fluctuate in accuracy, yet still meet their business needs, Hayes argued.
He would like to see banking regulators provide guidance on this question.
For large companies that lend to hundreds of thousands or millions of people, “small differences in terms of disparities they cause can open up a lot of opportunities for folks who wouldn’t otherwise have those opportunities,” Hayes said.
There are no prescribed tests for fairness, Hoopes said.
“A traditional test that is used is the adverse impact ratio,” he said. “There’s this notion of, what are your relative approvals of different groups, but there’s no requirement that your actual applicant pool look like America. So one lender could be incredibly inclusive and have a very inclusive applicant pool and they’re able to approve more Black and Hispanic borrowers than the traditional approach of only accepting applicants from traditional credit score metrics.”
AI-based lending models appear to be more fair
Lenders that use AI-based loan software such as Upstart’s or Zest’s said they dramatically outperform traditional models on both fairness and accuracy.
For instance, executives at Verity Credit Union in Seattle said in the year it’s been using an AI-based loan decision model from Zest AI, it has increased loans to people over 62 years old by 271%, it’s made 177% more loans to Black people, approvals for women increased 194% and loans to Hispanic borrowers grew 158%, yet delinquency and default rates are normal so far. Among all the lenders it works with, Zest AI said its software has helped increase approvals by 49% for Latinos, 41% for Black applicants, 40% for women and 36% for elderly applicants, all while holding risk constant.
Upstart said that compared to a more traditional underwriting model, its AI-based model approves 116% more Black applicants at annual percentage rates that are 36% lower and it approves 123% more Hispanic applicants at APRs that are 37% lower.
“There is an enormous amount of opportunity to improve traditional models on both fairness and accuracy grounds by using AI,” Hoopes said. “Upstart, Zest, Fairplay and Stratyfy have all published access to credit results that show that by using AI, lenders can dramatically improve the existing models on both of these vectors.”
There shouldn’t be a trade-off between fairness and accuracy, Hoopes said. “If you do get into a situation where you’re overpredicting default for a certain group, then that’s a great signal that you need to try to find a better model,” he said.
Saleh hopes industry participants can come to more agreement about how fairness and accuracy of AI-based loan decision models should be evaluated.
“This is a major question that is restraining less discriminatory alternatives from actually getting into production,” he said.