Why (almost) everything about Chris K’s article is right: A World of Difference exists at the Margins
As someone in tech and analytics viewing Chris’ article, his position is absolutely correct. It’s probably one of the more comprehensive clarifications and one I agree with to a very large extent. Yet, there is one very glaring omission nobody seems to be talking about.
I like Chris’ article. Also, I’m a die hard Remainer, make no mistake. I got the same urges to scream at him as most people would have done. However, I am first and foremost, an analytical personality. I believe in honouring facts and truth above all else. Since they lead you to the right conclusions on humanity automatically.
That brings me to say that none of what Chris has said in his initial piece, nor his follow-up to address the criticisms, is in any way incorrect. There are many bigger, more influential factors for voters than Cambridge Analytica’s “micro-targeting”.
Now, I don’t want to focus too much on Carole’s article nor the Guardian’s coverage. Regardless of how I, Chris, or anyone who is more familiar with the technical aspects would like to believe, Nix was right in stating that “it doesn’t necessarily have to be true. It just has to be believed”. That makes me sick to the pit of my stomach, but that’s society’s problem for uncritically accepting it, not mine.
Those data privacy points Chris mentions, have been hard earned for some. So, certainly in the UK, it is one of the most important lessons anyone can learn. Especially as we come up to the GDPR compliance deadline in May 2018. Yet again though, too little, too late. The data is over 3 years old. However, I will let Carole and The Guardian have the hack definition as known by the tech community. That doesn’t seem too illegitimate.
As someone working in tech and worse, as someone who was building and using similar algorithms for a different purpose some 5 years ago, I am perhaps more familiar than most with Cambridge Analytica’s approach. I have to agree with Chris, the words “micro-targeting” are a pretty big misnomer, granted. After all, a micro-target insinuates a very specific piece of information, requiring a very specific message, to guarantee that person receives it in the way that most appeals to them. This isn’t practically possible. Why? Because there are 17 million voters and they didn’t create 17 million adverts, which is what they would have had to do, to convince the electorate. That obviously didn’t happen. It didn’t need to happen. You merely needed to segment on the independent variables on the statistics. Micro-targeting is a marketing term, not a scientific or analytical one.
Your Real Identity != Your Digital Identity
People were (are?) segmented into factors based on online behaviours. None of the regression analyses were carried out on actual people. They were carried out on their digital selves. That is a very different thing. It’s no great secret that how many folk act online is different to their real life personalities. Trolls in particular are a primary form of that and it is all related to the anonymity that the internet gives you behind closed doors.
In addition, if they’re anything like me, they’ll reject or comment sarcastically or close Facebook ads for reasons unrelated to how they’re actually annoyed by them. Such confounding variables then all go into the mix and comes out in the data. Developers should expect those counterexamples in the information they receive from Facebook’s API, since they create covariates that explain the outcome they’re receiving. This naturally means that Cambridge Analytica’s data will be prone to “bad quality” data and thus, produce errors.
However, what is important here is targeting the online personality correctly with the right online message, for them to share it. Cambridge Analytica don’t nor arguably should, care about the real life personality. What people say and do are two completely different things. People are respectable out of the house, showing their best, but can abuse their family inside, who get the [dis]pleasure of seeing their worst.
Winning the Margins
Chris is spot on with regards the fact that Cambridge Analytica couldn’t have won the whole thing by itself. There is little to no chance. It is precisely why it is always attempting to locate persuadable voters. Core or hardened voters won’t leave no matter what. So these types of voters are a waste of time and resources. The reality is techniques like those employed by Cambridge Analytica are not going to push the votes enough to swing an entire, hardened group away from their core voting positions. That isn’t within the bounds of realistic possibility. Swing votes win it.
Yet, this bit is perhaps where I start to disagree with Chris. However, first, where I agree. The problem is perhaps better explained by using a classic example:
A test for cancer has a 99% correct detection rate and thus a 1% incorrect diagnosis rate. The prevalence of cancer in the population is 0.01% (1 in 10,000). If someone takes the test and it comes back positive, what is the likelihood they actually have cancer?
This is one of the most famous but also most common types of problem you see in medicine. Testing the population using the Multi-level Regression and Post-Stratification methods Cambridge Analytica used results in a statistic. It’s a dependent variable, but it’s still a statistic. As with all statistics, they come with type 1 (false positive) and type 2 (false negative) errors, which in turn, drive the reliability of the models. Without accounting for these errors, the models don’t present a true picture of what they are describing. In the above example, there is a world of difference between a 99% chance of having cancer when you don’t account for the errors, or a less than 1% chance of having cancer after diagnosis when you do. It is precisely why Bayes’ theorem and conditional probabilities are such a fundamental topic in medical higher education. They give you the true statistical picture.
This is important to the Cambridge Analytica story on several fronts, including but not limited to:
- Showing that CA’s claims will always be subject to error even if they were “true”.
- That error can wildly affect the results Cambridge Analytica report [based on who is reading them, how they are presented and their skill level in understanding — especially in Management Information reporting]
However, crucially, not only can you do the statistics on the probabilities, you can do the statistics on the error. Just like you can do the statistics on deviations, skew or anything else. This also means errors get more significant the smaller the event in the statistic’s general population and the sampling errors remain the same.
In the case of the Brexit vote, it had a narrow margin of victory and a scientifically unsound one at that. In no small part due to the propensity of errors to result from accepting outcomes which are either inconclusive, or otherwise don’t disprove the null. Setting a 50% threshold assumes zero type 1 and type 2 errors in the statistics, which we know for a fact is false. Since recounts don’t always give the same result.
When you allow the threshold to be where chance is, the smallest error can influence the result. This is where Cambridge Analytica’s work, which Chris rightly infers doesn’t influence huge swings in elections to any great degree, comes into its own. Since you literally only need to swing one more person than the other side, given a similar population segment side on each, to declare that [an inconclusive result] Leave had won. It could even be a byproduct of timing.
The net effect of this statistical segmentation together with appropriate marketing targeting activity, is to produce a large enough shift to even swing one undecided voter in the event of a hanged result. Arguably, something, did just that, but the undecided voters split both ways on the day (though the last two of these are a small number of data points. Treat this with care).
Indeed, Cambridge Analytica themselves, wild claims or not, only claim they can shift a result by 2 to 3 percentage points.
Carole and the Guardian had to distill all this to an audience who know next to nothing about writing services, writing Facebook Apps, setting Facebook permissions for code or consuming REST API’s using OAuth. It’s way too complex for a general public unfamiliar with programming. Hence, personally, I think they’ve done a superb job! The Guardian and Carole both did an absolutely sterling job of bringing data privacy and the use of personal information in advertising and political campaigning to the a mainstream population, whether intentional or not. None of us in tech could do that. Even though some of us in the tech world have been writing about Cambridge Analytica for a year to 18 months before the story broke.
The key takeaway is that while that 2 - 3% alone would not be sufficient in a well run, robust, super-majority election, it would definitely be enough in the event the threshold is the same as flipping a coin. That to me, is by far the most dangerous thing about the whole escapade and why the result itself is fundamentally unsound. It makes the result even susceptible to the wind blowing the wrong way, people voting for a joke, or protest, or [this doosie] because they didn’t think their vote counted. This is where both myself and I sense Chris, agree. At that threshold of 50% the effect of confounding variables is magnified very significantly, and Cambridge Analytica, together with the weather, the side of the bed and everything else, while theoretical co-defendants in this court, also existed for the Remain side and the undecided swing for Leave and Remain on the day wasn’t sufficiently different.
The crucial analytical takeaway is that the threshold of 50% magnifies the effects of those techniques employed by Cambridge Analytica, or indeed, any competent digital marketing agency. It makes them very viable indeed. Since they can and only need to sway small numbers and in the narrow margin, those small numbers become very, very big indeed!