How representative is your group?: Statistical identification of selection bias

How representative of the US population is your group:  your friends?  your employees or work colleagues?  your classmates?  your dating history?   

This calculator is a starting point for thinking about this topic.  It’s difficult to know how proportionally representative a group of people is, because (a) we don’t usually know the overall population proportion numbers offhand, and (b) most of us don’t know how to mathematically determine whether a difference in two percentages is statistically significant, i.e., whether it is unlikely to be due to random chance.   

The default numbers entered below reflect the men I phave dated.  I built this calculator in response to the January 2016 Muslim Ban (which I fervently oppose).  I wanted to see whether my history of never having dated a Muslim man implies unconscious bias on my part, based on the numbers.  It turns out the US population of Muslims is too small to answer this question definitively – so it remains solely up to my own introspection to assess the possible causes, and whether they are acceptable or whether they should motivate change.  Note that, in contrast, though the US population of Jews is similarly small, we can calculate an answer because it’s over-represented in my dating pool.  Additionally, it is very telling that non-whites are not actually over-represented in my group, despite the vocal criticism I field in conservative Colorado for “always dating brown guys”.  Perceptions are so often corrected by looking at hard data!  

What to do:  (1) Identify a pool you want to test.  (2) Enter the total number of people in the group, and also the number of people fitting each category.  Note that people often fit into multiple categories, and not every category needs to be filled.  (3)  Take note who is under- and over-represented in your group!   I’ve used an 80% confidence level to define significance here.

For further consideration:  

  • How would your group under-/over-representation results change if you compared to your city, instead of the whole US population?  What is the appropriate reference group for your case?
  • Do the under-represented categories reflect a known source of selection bias?  Is there another partial explanation for why the group doesn’t exactly reflect the population at large?  Is that explanation an ethically defensible one?
  • Is over-representation necessarily a good thing, meaning a lack of bias?  Who is being squeezed out by the over-represented category?
Proportional representation of identity groups in a pool of people
  Number of people in my group United States population My group Number of people over/(under)-represented by my group Likelihood that the difference is due to random chance
Women 51% % %
LGBTQ 4% 2%% (1) 41%%
Muslim 1.0% 2%% (1) 46%%
Jewish 1.4% 7%% 3 0%%
Black 14% 7%% (4) 15%%
Hispanic 17% 15%% (1) 63%%
Non-white (Asian, Indian, black, native) + Hispanic 37% 36%% (0) 92%%
Born outside US 15% 33%% 10 0%%
Military/veteran 9% 2%% (4) 6%%
Disabled 19% 0%% (10) 0%%
English not dominant language 21% 22%% 0 88%%
White-collar career 40% 87%% 26 0%%
Redhead 2% 4%% 1 39%%
Alcoholic 7% 5%% (1) 65%%
My group UNDER-represents these people:
My group OVER-represents these people:

