Back in June 2020, I did a pilot statistical analysis of defamation law cases in Australia, mostly to test if some of the popular intuitions about defamation law were true. For example, was it true (as the federal president (media) of the Media, Entertainment & Arts Alliance, Marcus Strom, claimed) that the ‘defamation system is stacked against Australian journalists’? Certainly, the cases he cited — Wilson v Bauer Media  VSC 521, Gayle v Fairfax Media (No 2)  NSWSC 1838, and Chau v Fairfax Publications  FCA 185 — did not seem like typical cases.
But I wanted to test this so that we could elevate the debate about defamation law reform from mere pub talk to something that was more evidence-based.
I looked at some international studies and, in doing so, firmed up my (now almost puritanical) belief that if we are going to do this kind of empirical legal research, the first obstacle is good doctrinal research to ensure that we really have got all the legal content right. The example I keep using is Naomi Wolf’s Outrages, in which Wolf claimed that the death penalty for homosexual acts continued long after historians had said that it eneded… except her entire thesis was built on a gross misunderstanding of the legal terminology. Over lunch yesterday, following one of my crazed rants about the centrality of doctrinal analysis to empirical legal research, a colleague noted Andrew Lynch’s review of Sheehan, Wood, and Randazzo’s Judicialization of Politics:
[T]his book suffers from difficulties against which comparativists should remain ever vigilant. Not only is its attempt to give a potted history of the High Court and its judges unsatisfactorily sketchy, but the way in which its empirical data has been developed reflects usage of categories and assumptions that do not have much obvious purchase in Australia.
By getting the law right and then matching it with statistical tools, we avoid creating niche and bespoke tools in empirical legal research that are disconnected from other research traditions and research questions.
But, although the pilot worked fine, I was unhappy with it for two reasons. First, I had tried to get around New South Welsh Disease by using a sampling method that tried to suppress its dominance over my data. This comes with a number of trade offs about the conclusions you can form, but I thought it would at least allow me a better shot at doing some interjurisdictional analysis. This intuition, while noble, sucked: the sheer volume of NSW decisions simply eclipses the smaller jurisdictions, some of which are simply too tiny to say anything interesting. So I traded off some capacities without any benefit.
Second, this is the first time this kind of data was available and it opened up a number of other research questions that look like they could give better insights into what’s going on in defamation law.
But all the maths worked. I could show that the MEAA’s point was based on extreme outliers and I could apply some theories about case selection to show that some intuitions about defamation law weren’t based on evidence.
With the pilot working fine, I wanted to beef it up to something more robust and fancier.
I also wanted to do it in a way that I could adapt for other kinds of research question, automating as much as possible. For the pilot project, I used Stata but I wanted to shift to either Python or R. I learnt Python when I was a kid and so suffered flashbacks to the dingy classroom where we did IT. And Dan Nolan informs us all that there’s no reason for Python anywhere. R it was, and thus set off a glorious adventure in learning to use R.
I won’t lie: had I done my data collection manually, I would be writing up the final version of my article by now. But I’ve learnt so much by fucking up my code and method over the past fortnight that I don’t actually mind all that much. The fact that Westlaw spits out cases in three different file formats, the fact that courts can have such a wide variety in how they present information, and the fact that so many ‘tokens’ in legal language are computationally difficult to describe… yeah, I’m having a lot of fun trying to work this one out.
Truth be told, I’m surprised about how much I have to do from scratch. Surely somebody else has got here first and worked out how to batch process judgements into a tidy format… but I can’t find them. It seems surprising that nobody else has already created
jader to do most of this already. I can download every book from Project Gutenberg or a complete library of Jane Austen, but Australian legal decisions are uncharted waters.