25 min read
Chatting with a libertarian friend over at a wine bar before the recent election, I was reminded how fear motivates conservatives to advocate against social welfare programs. My friend, for example, is so fearful of welfare fraud and voter fraud that he says he’d prefer to eliminate social programs in order to avoid the possibility of abuse. Strong emotion blinds him to the fundamental trade-off implicit in his policy position. A more constructive conversation focuses explicitly on the trade-off calculation and its consequences.
How fine is your sieve?
Political leaning boils down to what kind of evaluation “sieve” you prefer. Are you more worried about “letting in” someone “undeserving”, or “keeping out” someone “deserving”? Conservatives worry more about the former and progressives worry more about the latter.
For policies of inclusion or assistance, conservatives fear false positives and progressives fear false negatives. Conversely, when it comes to policies of exclusion or punishment (e.g., criminal justice), the concern is inverted: conservatives fear false negatives and progressives fear false positives.
My libertarian friend’s evaluation sieve has an extremely fine mesh: He is willing to refuse help to needy people, in the hopes of ensuring that no non-needy people ever pass through the sieve to receive public assistance. Similarly, he believes that disenfranchising eligible voters is an acceptable trade-off to prevent any cases of voter fraud from slipping through.
My sieve is more liberal, with a coarser mesh: I am more focused on the ethics of refusing help to needy people than I am with a few low-income people improperly slipping through. I am willing to accept that some people will undeservedly pass through a coarse sieve to receive non-needed welfare benefits – but I put far more weight on ensuring we don’t fail to serve the truly needy. Similarly, I know that if we make it easy for all eligible voters to vote, we might correspondingly see a few more voter irregularities – but I view that trade-off as socially and ethically beneficial.
- My friend the conservative is afraid that a soft, “gullible” system will make a mistake of over-inclusion and cost him too much money.
- I the progressive want to prevent a cold, “blind” system from being so overly-exclusive that we miss opportunities and fail to meet ethical obligations.
In contrast to this particular libertarian friend, when I do the trade-off arithmetic, I consider the actual rate of the undeserving passing through the sieve our system currently uses: Actual incidence voter fraud in US federal elections is documented to affect approximately 0.00001% of ballots. Reducing an error rate from 0.00001% to 0.000000% is prohibitively expensive and practically impossible.
Type I and Type II errors
How would my libertarian buddy and I decide which individuals need our taxpayer-funded help? We administer an evidence-based test (i.e., use a decision-making sieve) and make an inferential assessment. That assessment might correctly describe the true situation… or it might be a wrong conclusion. The combination of our inferential conclusion (help or no help offered) and unknown true situation (help or no help needed) leads to four possible outcomes:
|Null hypothesis is false||Null hypothesis is true|
|Assessment||Null hypothesis is false||True positive
|Type I error
|Null hypothesis is true||Type II error
“nothing to report”
Coming to a wrong conclusion is always a bad thing. Depending on the situation and one’s value system, either a Type 1 (false positive) or Type II (false negative) error is comparatively worse.
- Type I error = Incorrect rejection of the going-in assumption (null hypothesis). Seeing something that isn’t there.
- Type II error = Incorrect acceptance of the null hypothesis. Failing to see something that is there.
Type I and II error rates are dictated by the test’s confidence level, the data sample size, the effect size being measured, and the background prevalence of the phenomenon being measured. Administering a test with a high confidence level means that it takes a lot of evidence/effort to reject the null hypothesis. Thus, we’re less likely to have Type I errors — but instead we are, by definition, more likely to have Type II errors. Conversely, selecting a low confidence level generates more Type I errors in exchange for reducing Type II errors.
Test design involves an inescapable trade-off between two types of error. You MUST choose which one matters more in each situation. In other words: Would you rather be gullible or blind?
Welfare program eligibility
Null hypothesis: People generally don’t need public assistance. (It’s a safety net, not an automatic benefit.)
Burden of proof: Applicant must provide extensive documentation to prove eligibility
|Assessment||FALSE||Help many people who really need it||Accidentally help a few less-deserving people|
|TRUE||Fail to help some people in dire need||Turn away those who don’t qualify for help|
Social welfare was the first example in my wine-fueled libertarian-vs-progressive debate. In this system, we are selecting for inclusion – evaluating applicants to determine who deserves assistance. This evaluation “sieve” presumes that public assistance is not an automatic benefit, but rather a “safety net” for when structural economic conditions cause exceptional individual suffering. In statistics terms, the null hypothesis is that an applicant doesn’t qualify for help. The burden is on the applicant to provide evidence that they meet the government’s qualifying threshold.
Like all modern nations, the United States offers various “welfare” programs because of both (a) the moral imperative and (b) the economy-wide benefit of limiting desperation among the poorest. For example, giving cash to poor single mothers of infants correlates to their children having higher IQ, lower lifetime medical costs, and less criminality. Our society as a whole reaps long-term economic benefit from government spending money on welfare programs.
The progressive leaning is to worry most about the sin of failing to help the needy. Turning away a desperate, impoverished human being means that the decision process was too skeptical — we used too high of a confidence level for the evidentiary test. As a result, that test wasn’t powerful enough to include enough of the people we hoped to help. The “sieve” was too fine.
Conservative rhetoric often claims that there are untold numbers of well-to-do or lazy people free-riding on the welfare system. (Usually, as in my wine bar friend’s case, such passion is based on extrapolation from one or two anecdotes, rather than on data.) Their illogical argument in an appeal to the emotion of fear, suggesting that we’re being tricked and letting our tax money be frivolously deployed.
The progressive counterpoint is to consider data about who those “undeserving” Type I error beneficiaries actually are as individuals and families. Arguably, the inevitable Type I mistakes aren’t such clear-cut mistakes — nobody enduring the shame of welfare and living off its parsimony has anything close to an easy life. It should perhaps give the conservatives some solace to consider that false positive welfare recipients are still poor, and so there’s thus undoubtedly still some multiplier effect creating a secondary benefit to society.
Re-framing the policy debate as a less-emotional discussion about the relative costs of Type I and Type II errors enables mutual understanding – and perhaps a path to compromise: What is the false positive and false negative rate of our current system? What are the financial and social costs of each type of error? What trade-off are we willing to accept between accidentally excluding the needy and accidentally including the less-needy?
Null hypothesis: Accused person is innocent
Burden of proof: Prosecutor must prove guilt, beyond a reasonable doubt
|Assessment||FALSE||Acquit innocent person||Convict innocent person (4%)|
|TRUE||Acquit guilty person||Convict guilty person|
In the criminal justice system, we are selecting for exclusion – evaluating the accused to determine who deserves punishment. The threshold of reasonable doubt intentionally makes it difficult to reject the null hypothesis of innocence. Our system recognizes that convicting an innocent person (Type I error) is morally much worse than acquitting a guilty person (Type II error). That trade-off is made even more morally obvious because convicting an innocent person almost always means that a guilty person has remained unpunished for the crime in question.
The progressive viewpoint is to worry most about convicting the innocent. A false conviction means that the jury wasn’t skeptical enough of evidence – it used too low of a confidence level for the evidentiary test. In the past few decades, America has been shocked awake about the staggeringly high false conviction rates in our criminal justice system. One in 25 death sentence convictions have subsequently been proven false. That’s a 4% Type I error rate… where the moral consequences of every single uncorrected error are astronomical.
Conservatives traditionally use rhetoric grounded in fear of acquitting guilty people. Indeed, sometimes the null hypothesis is implicitly characterized as guilt, challenging the accused to prove innocence. The problem with this inversion of the US constitution is not the oft-repeated idea that establishing certainty about the non-existence of something (guilt, god, bigfoot) is challenging. (Establishing absolute certainty about the existence of those same things is also challenging.) Rather, presumption of innocence is a universal human rights standard because society has agreed that false convictions are worse than false acquittals. Placing the burden of proof on the accuser is intended to limit Type I errors (and to ensure that accused innocents are treated well all the way through the process until they are, hopefully, acquitted).
New York City’s police department infamously accepts a very high Type I error rate (frisking innocent people) in hopes of lowering their Type II error rate (failing to prevent crime, by not frisking gun-toting troublemakers). The practice trampled the right of hundreds of thousands of innocent young black and Latino men to walk around in their own neighborhoods… and has failed to significantly lower crime rates. The very fear used to justify the practice is left unquelled by the practice’s abysmal results in crime prevention.
This poorly-designed stop-and-frisk “test” yields an astonishingly high Type I error rate of 98.2% (% of stops where no gun is found) and thus only a 1.8% positive predictive value (% of stops where a gun is found).
Re-stating the debate as a difference in relative concern about Type I versus Type II errors is useful. We can perhaps nudge conservatives and progressives out of the deadlock of conflicting value systems and into dialogue: How can stop-and-frisk advocates explain why abrogating rights of 49 people to find 1 gun is an ethically defensible trade-off? Could we create alternate “tests” for guns that have a higher predictive value, and thus a less egregious civil rights cost? If we continue this practice with such a high false “conviction” rate, how can we soften the real person harm of all those false positives?
Hiring and promoting women
Anti-woman bias in the workplace equates to a high false negative rate in hiring and promotion. Systematically failing to acknowledge and reward women for their valuable capabilities and contributions constitutes a Type II error of omission (blind/dismissive). Overvaluing men who aren’t actually better performers is a Type I error of inclusion (credulous/gullible). The deeply-biased “test” for hiring begins with a skeptical null hypothesis that discriminatorily burdens women with proactively proving our worth. It’s like a warped judicial system wherein the accused must prove their own innocence.
Again, fear is the culprit: fear of working with someone different than oneself (gender, race, age, religion, etc), fear the comfortable status quo culture will change. Ironically, fear of making a mistake leads directly to the very costly mistake of over-exclusion.
Slowly, some progressive companies have begun to realize the high cost of Type II hiring errors relative to Type I hiring errors. Failing to recruit from half the population means that a company is, by definition, reaching deeper down into the barrel of male talent – which ultimately costs the company in productivity, innovation and competitiveness. Meanwhile, accidentally hiring or promoting the wrong person can be reversed, once observed performance clearly diverges from expectations. From a rational economic perspective, companies have a much to gain and little to lose by adopting a much coarser hiring sieve to proportionally include women.
[See my related article “The Lost Generation”]
Null hypothesis: Person isn’t eligible to vote
Burden of proof: Voter must produce eligibility documents at polling location
|Assessment||FALSE||Eligible voter casts ballot||Rare cases of voter fraud (0.00001%)|
|TRUE||Disenfranchise eligible voters (4.4%)||Ineligible voters not allowed to vote|
In the case of voter eligibility, our system is one that selects for inclusion by requiring would-be voters to prove eligibility at the polls (with specific identification requirements varying widely by state). Therefore, the progressive preference is for a coarse sieve. We ought to make it relatively easy for people to register, prove their eligibility at the polls, and exercise their constitutional right to vote. Wrongfully excluding many eligible voters does far greater harm to society than a rare case of counting a fraudulent vote.
Conservatives’ fear-based worldview leads to willful misapprehension of the Type I error rate by multiple orders of magnitude. In fact, out of 197 million votes cast in federal elections between 2002 and 2005, there were 26 confirmed cases of voter fraud (i.e., ineligible voters being allowed to vote). Assuming that all fraud instances were caught, that equates to a Type I error rate of less than 1 in 7 million, or 0.00001%. In other words, we’re currently screening voters at the polls with a 99.99999% confidence level. (An ultra-high-availability computer system with that level of “seven nines” service guarantee would have just 3 seconds of downtime per year.)
Meanwhile, in states with strict photo identification laws, studies estimate that 11% of eligible voters lack qualifying identification. If 40% of those voters turn out to vote (per average federal election turnout rates), that means as many as 4.4% of eligible would-be voters are disenfranchised. A Type I error rate of 0.00001% and Type II error rate of 4.4% means that our system trades off over 300,000 disenfranchised voters for every 1 voided illegitimate ballot. (Note: Though strict photo ID laws have been proven in court to disenfranchise voters and suppress turnout, they are not necessarily the direct means by which we catch voter fraud. The relationship between avoiding fraud and avoiding disenfranchisement is another — more complex — story.)
Economic logic tells us that the marginal cost of reducing an already-minuscule Type I error rate would be an inefficient use of taxpayer money. Typically, conservatives would eagerly support such an argument. However, in this case, political power agenda trumps economic principle. Eligible voters lacking photo identification are disproportionately low-income and left-leaning. So, the harm done to them helps conservative candidates. To deflect charges of intentional voter suppression, conservatives focus on obfuscating data about what is, in truth, the vanishingly low base rate incidence of voter fraud. They incite fear of something that is not, in fact, happening.
Re-framing this debate can advance an otherwise-stalled dialogue. We can pose questions that directly address the underlying disagreement: How does the marginal cost to taxpayers of further reducing a low Type I error rate compare to the marginal cost of reducing a high Type II error rate? How many disenfranchised voters is one avoided fraudulent vote “worth” to us as a society? How can we better communicate voter fraud data, to combat the factual ignorance that underpins support for voter suppression laws?
Refugee vetting for asylum
Null hypothesis: Asylum-seeker deserves refuge
Burden of proof: Immigration department must identify security threats
|Assessment||FALSE||Turn away refugees who pose a threat||Deny help to many innocent people in dire need|
|TRUE||Rare cases of granting visas to dangerous people (0.00009%)||Provide refuge, liberty, opportunity to hard-working good people|
Progressives further point out that the “false negative” rate of accidentally giving out visas to terrorists is comfortingly low: Among 3.25 million refugees admitted into the United States 1975-2015, 3 caused a death. Thus, our current asylum applicant test has a 0.00009% Type II error rate. That’s a 99.99991% power level (in exchange for what is theoretically a low confidence level – which we can’t calculate because the number of rejected refugees who would have been domestic terrorists is unknowable).
(Note: If you prefer to conceptualize our current system as selecting for inclusion, just swap all of the vocabulary. The conclusion remains the same if we invert terminology to describe the null hypothesis as “terrorist”, progressives’ concern as avoiding “Type II” errors of wrongful denial of asylum, and “false positives” of over-inclusion as historically low.)
In truth, because the base rate prevalence of terroristic leanings among human beings is very low, it is mathematical corollary that most rejected refugees are harmless. When prevalence is very low, false positives (i.e., rejecting innocent refugees) are by definition more numerous than true positives (i.e., rejecting terrorists). Misapprehension of this counter-intuitive truth is the same “base rate fallacy” that reared its xenophobic, bigoted head in the 2016 election cycle’s Skittles-refugee comparisons.
[See my related article “Skittles vs Refugees: The humanitarian cost of inferential error”]
Conservatives’ attitude on refugee immigration is entirely explicable as a manifestation of a fear-based worldview. Conservatives put more emphasis on the low risk of admitting one horrifically dangerous person, compared to the high risk of failing to help hundreds of thousands of innocent victims of war. Fear makes them overstate the low risk of a Type II error, and fear makes them less motivated by the humanitarian failure of large-scale Type I errors.
Psychology research shows that conservatives’ minds are more sensitive to threats of harm and oriented toward protective separation, whereas progressives’ minds are more sensitive to threats of loss and oriented toward community. We can imagine how both worldviews would have been useful evolutionary adaptive traits, and how both remain valuable today. But, because people don’t easily alter their worldview, the policy debate stalls as a clash of worldviews.
Again, it is more fruitful to re-frame the debate as a less emotional one, based on consideration of real data and statistical trade-offs:
- How much higher do we think terrorism prevalence is among Syrian refugees today, compared to the known low rate of 0.00009% among all refugee immigrants over the past 40 years?
- How could we change our immigrant vetting process to keep us safe (by minimizing false negatives), without violating our nation’s core principles (via high false positive rates)?
Null hypothesis: No trends (signals) exist in the data (noise)
Burden of proof: Researcher seeks to identify all market signals for a business decision-maker
|Assessment||FALSE||Correctly report real market signals||Misleading report of random noise as a market signal|
|TRUE||Report nothing, even though there are trends in the market||Correctly report an absence of market trends|
When we move out of the realm of social policy and into the business world, relative values of Type I and Type II errors shift. Typically, there are only dollars at stake and no direct humanitarian cost of drawing erroneous conclusions about the world. Which grade sieve is appropriate is therefore highly situational.
The aim of market research is to identify actionable market trends and customer preferences – to find meaningful “signals” within the “noise” of voluminous data. Business decision-makers use Bayesian methods to incrementally update prior beliefs based on new research results. And, those market research results are but one of many information sources considered.
Decision-makers want to consider numerous possible signals – not just the few that would pass through a super-fine sieve using a high confidence interval:
- Spurious findings (Type I errors) aren’t harmful because they go through additional post hoc judgment filters and don’t independently drive action.
- In contrast, incorrectly believing there’s nothing happening in the market (Type II errors) can be quite costly to a business. We are very concerned about missing real phenomena.
Therefore, market research optimally uses a low confidence level to define statistical significance, allowing more potential signals to surface and be reported as significant findings.
[See my related article “Intentional gullibility: Slash your statistical confidence level to 80%!”]
Product safety testing
Null hypothesis: Product has no defect
Burden of proof: Tester must show a product is defective, to justify pulling it out of the supply chain
|Assessment||FALSE||Destroy defective products||Waste money by dumping some non-defective products|
|TRUE||Ship dangerous products to customers||Ship safe products|
In manufactured product safety testing, there is a humanitarian cost on only one side of the equation. The trade-off between Type I and Type II errors is a trade-off between money and people.
As in the market research example above, everyone is logically more worried about false negatives than false positives – politics doesn’t affect how people value the two error types. But, in product safety testing, a Type II error is even worse than in market research: it actually harms customers. We draw the same conclusion as in market research regarding which error type is most costly, but we may go even further to lower confidence and increase power levels of our test to minimize that error.
As in the immigration example above, everyone is worried about false negatives because they harm people. (Bad products can be dangerous; immigrants can be dangerous.) However, in product safety testing, we are trading off only money for people’s safety – whereas the refugee question forces us to trade off some people’s safety for other people’s safety.
Wasting money on dumping some perfectly good product involves no ethical cost. On the other hand, harming our customers does. Just as “refugees aren’t Skittles” (per the famous tweet by Skittles manufacturer Mars, Inc.), so too people aren’t products. When faced with a money-vs-people tradeoff, politics don’t apply. Everyone agrees we must focus primarily on minimizing Type II errors and keep bad products off the shelf.
Depending on the specific harm a defect would cause (inconvenience, discomfort, illness), the monetary value of identifying defects changes. Additionally, a manufacturer’s self-interest also lies in spending money to minimize Type II errors, due to the reputational ripple effects of shipping bad products. In this case, both public safety and self-interest are in agreement as to the Type I-Type II error tradeoff.
Null hypothesis: No effect/link exists
Burden of proof: Researcher aims to show that there is a link/effect
|Assessment||FALSE||Publish important findings||Publish non-replicable results, damage reputation (5%)|
|TRUE||Fail to identify potentially important effect (~30-60%)||Uninteresting outcome – nothing to publish|
Physical science researchers are extremely concerned about the embarrassment of reporting false positive results. False positives misdirect further experiments in the wrong direction. They damage the researcher’s personal reputation and public credibility of the scientific process.
False negatives, in contrast, aren’t as harmful. Other research teams will eventually identify and publish the real effect that any one particular experiment fails to uncover. For individual scientists, lack of publishable results is indeed a disappointing, missed opportunity, but not punitive.
In contrast to the market research example above, scientific research has good reason to prioritize avoidance of Type I errors over avoidance of Type II errors. Therefore, a higher confidence level (e.g., 95%) is warranted. Meta-studies of published scientific papers have shown that the high confidence levels and experimental design (effect size, sample size) commonly yield Type II error rates (1 – power) exceeding 50%. Science is quite willing to be blind in order to avoid gullibility.
Consider the consequences
Improving the quality and productiveness of policy debates starts with gathering data:
- What is the base rate incidence/prevalence?
- What are the Type I and Type II error rates in the current system?
- What are the financial, ethical and social costs of each type of error?
From that factual basis, conservatives and progressives wielding opposing value systems can more rationally clarify their position regarding inescapable trade-offs: How much are we willing to pay to reduce either error rate towards zero? How many of mistakes of over-inclusion are acceptable to avoid one instance of over-exclusion?
Ultimately, to satisfy everyone, a system must include policy responses to the consequences of each inevitable type of error: Given that we will always have some false criminal convictions, how can we provide adequate remedies? If we settle on a narrower welfare system, how can we provide a secondary safety net for those in dire need who nonetheless inevitably fall between the cracks? If we continue with extreme vetting and rejecting refugees, can we appropriate money to create a robust appeal system, or to subsidize their resettlement in more receptive countries?
The failure mode that many conservatives bring to policy debates is a refusal to consider the consequences. Even within conservative politics, discussion quality is greatly improved by framing issues as Type I-Type II error trade-offs. For example:
- My right-leaning libertarian friend in the wine bar was initially only concerned with keeping the “undeserving” people “out” and government expenditure low – implicitly (and irrationally) at any cost. He hadn’t considered the downstream fate of people wrongly excluded in his idealized-but-inevitably-imperfect system. Even though the humanitarian argument doesn’t move someone like him, he does care about the long-term net cost to society of the wrongful exclusion (once it’s brought to his attention quantitatively).
- In contrast, a center-leaning libertarian friend adopts a more holistic perspective from the outset: She too may fervently prefer limited entitlement spending. But, before simplistically advocating budget cuts, she considers the real financial, social and ethical costs of withdrawing assistance from poor people. She realizes that reduced spending must be packaged with a concrete plan to mitigate the consequences of errors of over-inclusion and over-exclusion.