[Edit: 23:11 UTC – if you got this by email, this version is rather different, I edited and expanded it to make it clearer.]
This is another follow on post from my criticism of the use of Bayes’s Theorem in Richard Carrier’s book Proving History. (Apologies if you’re bored of this topic). In the review, and my follow up introduction to Bayes’s Theorem, I did a bit of ‘vague handwaving’ about errors, and was asked to be more specific. This is an attempt, hopefully still accessible without a lot of mathematical knowledge.
The Effect of Error
So what do we mean by error? Any time we give a number, we have to recognize that the number we give is only an approximation. There is some underlying value, but whatever we do, we can only generate an approximate version of it. There are different ways to deal with the approximation, but one of the most intuitive is as an error range. If we estimate something as 0.2, for example, we could instead say “it is between 0.1 and 0.3, but 0.2 is the most likely”.
When you have an equation that involves inputting some approximate value, you can move the error range through the equation, and come out with a range of possible results. You can say, for example, that if our input X is between 0.1 and 0.3, and the formula we are working is 2X, then the range of outputs is 0.2 to 0.6.
When we have multiple inputs, then the situation is more complex. We might have two inputs: X is between 0.1 and 0.3, while Y is between 0.5 and 1.0, so X times Y is between 0.05 and 0.3 . We have to try all combinations of low and high for both values (and in the case of some formulae, but not Bayes’s Theorem, intermediate values too), and find the minimum and maximum result.
So let’s think about the errors in Bayes’s Theorem. I’ll use Carrier’s preferred version:
Let’s look at the errors in this. There are three inputs to this equation, P(H) [note that P(~H) is just 1-P(H)], P(E|H) and P(E|~H).
Here’s a graph of the equation:
Because there are three inputs and one output, the graph of the function would actually be four dimensional. So to draw it, we need to lock one value. This shows P(H) locked – so it shows how the result changes, when we change the P(E|X) terms. We can graph the same thing with a different P(H), let’s say it is 1%:
You can see that, with a much lower prior, the output probability is almost always very low. Except near the extremely low values of P(E|~H) and the high values of P(E|H). While the center of the graph has got flatter, the back has got steeper. This steepness will cause problems for us below.
Any point on this surface consists of a single set of values moving through Bayes’s Theorem. To look at errors, we instead consider a patch of space on the surface, some range for P(E|H) and P(E|~H), and look at the range of outputs for that patch. In the second diagram, if P(E|H) and P(E|~H) are both about 1/5 then the range of values are pretty small, the vertical range in that area is pretty small.
If, however, we say that P(E|~H) is near zero, or P(E|H) is near one, then suddenly the vertical range is huge, because the graph is steep at that point. Even small errors in the input (i.e. a small patch) can give a range of outputs that is from nearly zero, to nearly one (i.e. it could be anywhere from almost impossible to almost certain).
So in this case I’ve locked P(H), as if P(H) were certain, but of course that is also a range, and so you have to imagine two of these surfaces stacked, one for the minimum P(H), and one for the maximum. The output range is the minimum and maximum points for both upper and lower surfaces.
We can draw a different graph, using a different locked value, and see how the output varies with the varying axes.
So here we can see that, if P(H) is small, we get this similar vertical range, and corresponding problems with errors.
Look back at the first or second figure, it shows that, if P(E|H) and P(E|~H) are both small, then the range will be very badly behaved. In other words, if the evidence is genuinely unusual, in both cases, then we’ve got a problem. So it isn’t just as trivial as saying “let’s use a conservative value of X”, because behind that value, may be a big change in the output.
This is a problem because, almost by definition, when dealing with events such as the founding of major religions, or the possibility of a human being having a divine parent, or the likelihood of a resurrection, we’re dealing with insanely small probabilities. Exactly the times when Bayes’s Theorem isn’t well behaved.
If P(E|~H) is high, and P(E|H) is low, then things behave quite well. But if we want to be conservative (Carrier’s ‘a fortiori’ method) about P(E|~H), say, and allow the possibility that it is small, then the errors can swamp any useful conclusions.
Sources of Error
It is important to think about the sources of error to be able to make reasonable estimates of how much error we have.
Here are some sources of error (there may be others):
Incomplete Data — Say we’re trying to figure out the P(H) for Julius Caesar being in Alexandria at a particular date. We decide that P(H) (the prior) should be the rough proportion of time that he spent in the city in the years of his reign. We go through the documents and come up with a figure, let’s say it is 2%. This figure will have some error. It is possible we don’t have some records of some of his visits. It is possible that some of the recorded visits are mistaken, misleading or falsified. There will be error in the value we give. With data from ancient history this error can be large. It is particularly important not to assume that a lack of a piece of evidence for something didn’t mean it didn’t happen. Counts of events are almost always going to be wrong, when our documentary record is so fickle.
Choice of Reference Class — When we figure out values for probabilities, we take a set of similar events, and we compare how often something is true among them. For Caesar in Alexandria, we choose the set of days when Caesar was anywhere, and see how often he was in Alexandria. This set of events is called the reference class. But there are several reference classes we can choose. To determine the prior of Julius in Alexandria, we might note that, on the day in question, local rulers from around the Mediterranean were invited to Alexandria. We might say the prior, therefore, is given by the proportion of significant rulers we know attended. Perhaps 80% of rulers who were invited, came. So the prior is 80%. The choice of reference class affects values hugely. Now it might be obvious that one reference class is a better choice than another (we could calculate the value based on what proportion of the whole world lived in Alexandria on that day, for example, which would obviously be a poor choice). But no matter how well we chose, there will be some degree of error. Carrier’s approach (not unreasonably) is to try to pick the best reference class we can, and then assume it is correct. In reality the choice, no matter how good, is a source of additional error.
Choice of Definition — When we ask whether something is true, we are asking a black and white question. It is, or it is not? But most questions we could ask could have intermediate answers. Was Caesar in Alexandria on that date. Well, he left on that date, does that count? Even if he left at 1am? Does it count if he was at his camp just outside the walls? Or if he was 20 miles away, but he was communicating by messenger? Once we get into questions like “was there a worldwide darkness as reported in the gospels”, the vagueness is rather significant: how much land area do we need to qualify as worldwide? How dark is enough to qualify? Does the darkness have to be uniform over the affected area? How long must it last?Should we exclude rather obvious natural possibilities? So whatever figure we give, there are errors based on how we interpret the question. We can phrase the question more tightly to reduce the error, but we may be in danger of answering the wrong question. We might insist on a purely supernatural total darkness from sun and moon covering the whole globe, only to find we end up disproving a claim that nobody wants to make. Or else we might find that our tight definition gives us no obvious reference class, or a reference class who’s data is hopelessly incomplete. Instead we might allow a wider definition, and allow that ‘world-wide darkness’ could refer to a huge storm complex over the Mediterranean (the whole of the Roman Orbit Terranum), but either get a huge range of possible outputs, or else show something we all agree on. It is hard to give definitions that are tight enough to avoid error, while lose enough to be interesting. I’ve posted before about the fact that the definition of “Jesus was a myth” is so vague as to basically include both mainstream scholars and mythicists.
So each term in our Bayes’s formula acquires errors from all of these three factors. And each factor compounds the errors in the others. As a result, for questions that are potentially vague, with a range of possible reference classes, each with poor quality or incomplete data, we should expect to have large errors.
So far I’ve assumed that errors are just random. We are as likely to be higher than lower in our estimates. But this isn’t true.
Carrier, for example, seems to recognize this, and decides to use ‘a fortiori’ reasoning. Which is a way of saying “I’m going to bias the error in a way that doesn’t support my case, so I avoid the criticism that I may have accidentally biased it towards my conclusions.” This is admirable, and (barring the caveats around small values above) reasonable. But that only looks at bias from one source: bias from the available data. In reality Carrier (and anyone else doing this) will also be choosing the definitions, and choosing the reference classes, and there is no similar a fortiori process for determining which are the least favourable definitions to ones cause, and which reference classes are the most troubling, and adopting those.
So, what can we learn?
Well, for one, the inputs to Bayes’s Theorem matter. Particularly small inputs. When we’re dealing with rare evidence for rare events, then small errors in the inputs can end up giving a huge range of outputs, enough of a range that there is no usable information to be had.
And those errors come from many sources, and are difficult to quantify. It is tempting to think of errors only in terms of the data acquisition error, and to ignore errors of choice and errors of reference class.
These issues combine to make it very difficult to make any sensible conclusions from Bayes’s Theorem in areas where probabilities are small, data is low quality, possible reference classes abound, and statements are vague. In areas like history, for example.
 This is simple to calculate, but may not be true. If, when we estimate X and Y, we might rely on the same underlying data, which means errors in one value might be related to errors in the second. In that case, we say the errors are correlated, and the way they are correlated changes the way the errors flow through the equation. For the purpose of this post, I’ll assume that errors are independent. I don’t think this is a valid assumption for using Bayes’s Theorem in history, because the person doing the estimates is using their biases to do so: so errors will be quite correlated. The effect of this is to make the analysis of error even more complex.
 I can hear my internal math tutor weeping at some of the generalizations I’m making here. The ranges are really just an approximation of something called the probability density function: which is a way of saying how likely any possible value might be (the values close to our guess are hopefully the most likely). And when you put the p.d.f.s through a formula you use a process called convolution, which generates another p.d.f telling you the likelihood that your output is any particular value. General convolution is hard, though, and involves calculus. So you might have to trust me that the approximations that I’m giving in terms of ranges, do reflect what would happen if you ran the full math.
 There is also another source of bias at this point that isn’t easily mapped onto errors.Choice of Evidence The person considering Bayes’s Theorem on a historical event gets to choose what features of the evidence they consider. Are the gospels “stories of a Divine Human”? If one uses “stories of a Divine Human” as the definition of one’s evidence, then one will end up with a different conclusion than “stories of a Jewish Messiah”, for example. Perhaps you say we should have both “stories of a Divine Jewish Messiah”, that would be better, but we’ll hit reference class problems with that (to what other figures do you look for similar evidence?). One can inadvertently introduce bias by considering different pieces of evidence. You might object by saying that Bayesian probability allows us to accumulate any number of pieces of evidence. We can keep adding extra claims and updating our probabilities. This is true, but it is rarely done (never in Carrier’s book), and even if done, some of these ways of defining or separating out the evidence are not independent (the messiah and divine claims above, for example), so cannot be accumulated by Bayes’s Theorem in the normal way.