# The Great H2G2 Researcher Count

Created | Updated Jan 28, 2002

[I *THINK* IT'S DONE... SOMEONE TELL ME WHETHER THIS IS READABLE!]

Questions, comments, complaints, observations, cash offers - post below!

Don't like maths? Scroll down to the 'Data' section to see the latest results!

Welcome to the Great H2G2 Researcher Count, a statistical study the likes of which has never been attempted before on these servers!

This entry is intended as a place for explanations of the methods being used, a place to summarize the latest data analyses, a place to post questions and comments, and a place for me to try my best to be entertaining! <softshoe>^{1} Here's a link to the Data-Gathering Headquarters.

Project Summary

In order to estimate how many active h2g2 resarchers there are, we are using a list of researchers that Archangel Galaxy Babe has met, IRL. The list can be found on her page, and contains about 35 non-italic researchers. The goal of the project is to estimate what percentage of the total number of active researchers are represented by this list. To estimate this percentage, we are making many observations, at different times, of what percentage of researchers online at a given moment are on AGB's list. The average of these percantages should be the number we're shooting for.

Once we know what percentage of the total active researchers AGB has met, it will be easy to calculate the number of total active researchers. For example, suppose that we decide that AGB has met 1/20, or 5% of the total active researchers. Then, we will say that:

p = 0.05

At any rate, **p** will be a number between 0 and 1. Now, AGB has met 46 currently active researchers, so we can write an equation which says:

p * Total = 46

Therefore:

Total Researchers = 46 / p

If, like in our example, p=0.05, then we will calculate that the Total number of active researchers is equal to:

46 / 0.05 = 920

Make sense? Remember, if you have any questions, just post them at the bottom of this entry, and I'll answer you as well as I can.

Bonus Info

Since we're getting the number-of-researchers-online data *anyway*, we'll end up with a good estimate of how many researchers are usually online at once. Combining this with our main target - total number of active researchers - we'll be able to figure out how much time the average researcher spends logged in, and you can see just how far above average you are!

Statistical Details

The two numbers that we're estimating are **R**, the average number of people logged in at once, and **P**, the percentage of people logged in whom AGB has met. Those capital letters indicate the actual, true, real values, which we'll hopefully get very close to. We'll use lower case letters, **r** and **p** for individual observations.

Let's talk about R first.

R - The average number of researchers online

The easiest number to calculate is the average of all the **r**'s reported by counters. This average is the best estimate of the true value **R**. Introducing some notation, let's say:

sum(r)

R = avg(r) = ------------

n

sum(r)

R = avg(r) = ------------

n

...where **n** is the number of observations we have. The problem with the above equation is that it's very unlikely that **R** actually *equals* **avg(r)**. Chances are that it's fairly close, but not actually equal. How close? To say that, we need to introduce another measurement, which statisticians love, called the 'Standard Deviation'^{2}.

Standard Deviation is a measurement of how 'spead out' data are. Briefly, if a bunch of numbers have an average of, say, 42, but the numbers themselves range all over the map, from 10 to 110, then they have a large standard deviation. If, OTOH, the numbers with an average of 42 only range between 39 and 45, then they have a small standard deviation. The standard deviation of **r** - let's call it **sd(r)** - is calculated with the following formula:

_ _

| sum[(r - avg(r))

sd(r) = sqrt| -------------------- |

|_ n-1 _|

_ _

| sum[(r - avg(r))

^{2}] |sd(r) = sqrt| -------------------- |

|_ n-1 _|

If that formula doesn't make sense, don't worry. The point is that **sd(r)** is a measure of how much the individual **r** values are spead out from their average. That's why, in the numerator of that fraction, you see **r - avg(r)**; that's how far away a specific **r** is from the average. If most of them are far away, the standard deviation will be large.

Finally, and this is why we bother to calculate standard deviation, we can calculate what's called a 'Confidence Interval'.

A confidence interval will look something like this:

95% Confidence Interval for R: R = 42 ± 12

So the trick is how to find the right plus-or-minus number that you can be 95% confident about^{3}. Statisticians like to call that number **E**. Who are we to argue? Here's the formula:

_ _

| sd(r) |

E = t * | --------- |

|_ sqrt(n) _|

_ _

| sd(r) |

E = t * | --------- |

|_ sqrt(n) _|

...where **t** is a Magic Number that comes from a table in a book!

P - The percentage of researchers on Galaxy Babe's list

We estimate **P** with exactly the same process we used to estimate **R**. Again, we start with the average. (Remember, lower case **p** represents the individual observations, and capital **P** represents the true number which we're estimating.)

sum(p)

Best estimate of P = avg(p) = ------------

n

sum(p)

Best estimate of P = avg(p) = ------------

n

Again, it's very unlikely that the number we get by averaging all of our **p**'s together actually *equals* the true **P**.

To construct a confidence interval for **P**, we need to calculate something another standard deviation. This one, we'll call '**sd(p)**'.

_ _

| sum[(p - avg(p))

sd(p) = sqrt| -------------------- |

|_ n-1 _|

_ _

| sum[(p - avg(p))

^{2}] |sd(p) = sqrt| -------------------- |

|_ n-1 _|

Again, just like with **R**, we can now construct a confidence interval for P. The confidence interval will look like:

P = avg(p) ± E

...where **E** is determined by:

_ _

| sd(p) |

E = t * | --------- |

|_ sqrt(n) _|

_ _

| sd(p) |

E = t * | --------- |

|_ sqrt(n) _|

Again, **t** is a Magic Number, and what it actually equals depends on what **n** is and also on how confident you want to be. We're using 95% confidence intervals, and you can look up the proper value of **t** in a 't table^{4}'.

The Total Number of Active Researchers at H2G2

Now, this is where we use the formula from waaaay back at the beginning of the entry. The unforgettable:

Total = 46 / p

One thing, though, is that there's a funny thing about the list. Of the 46 active researchers there, 4 of them have changed their names enough that someone who doesn't know them might not recognize them under their new name. In other words, someone who isn't name-change savvy might only be able to count 42 of the 46 active researchers on the list.

The way that we're handling this problem is by splitting the difference, and including the error in our considerations. We'll just say that the number of active researchers on the list is 44 ± 2. Neat, huh? (2 out of 44 is about a 4-and-a-half percent error, on top of whatever error we've got anyway. Good enough for government work, y'know?)

So, our formula for calculating the Total Number of Active Researchers will be this:

Total = 44 / p

...where we'll use **avg(p)** for **p**, since it's the best estimate. The plus-or-minus will be taken into account in determining the plus-or-minus for our estimate of the Total. I'll spare you *that* formula. Post if you wanna know.

Bonus!

Finally, here's how we calculate how many hours per week the average researcher spends online. First we need to know how many researcher-hours per week are spent at the site. The formula for this is:

Total Hours = avg(r) * 24 * 7

Where **24** is the number of hours in a day, and **7** is the number of days in a week.

Now take the **Total Hours**, and divide by the **Total Researchers**.

Total Hours

Hours/Researcher (weekly) = -------------------

Total Researchers

Total Hours

Hours/Researcher (weekly) = -------------------

Total Researchers

...with plus-or-minuses handled in the standard way, whatever that is.

Data - last updated 14:00 GMT, 05/12/2001

General

**n** so far: **55**

Corresponding Magic **t**: **2.0049**

R - Average number of researchers logged in at once

Values for **r**: **23, 39, 51, 43, 49, 64, 49, 21, 19, 24, 51, 37, 60, 61, 60, 68, 65, 63, 37, 21, 21, 23, 65, 63, 44, 36, 19, 67, 18, 48, 61, 58, 71, 69, 59, 36, 16, 30, 56, 76, 50, 54, 26, 24, 14, 63, 60, 63, 55, 33, 26, 40, 47, 31, 43**

**avg(r)** so far: **44.9****sd(r)** so far: **17.6**

95% confidence interval for **R** so far: **R = 44.9 ± 4.8**

P - Proportion of researchers that AGB has met

Values for **p**: **0.130, 0.128, 0.039, 0.070, 0.041, 0.016, 0.041, 0.095, 0.000, 0.000, 0.118, 0.081, 0.117, 0.197, 0.083, 0.074, 0.046, 0.048, 0.054, 0.048, 0.000, 0.174, 0.092, 0.159, 0.068, 0.083, 0.053, 0.030, 0.056, 0.188, 0.098, 0.086, 0.085, 0.058, 0.085, 0.056, 0.063, 0.100, 0.196, 0.145, 0.100, 0.111, 0.077, 0.042, 0.071, 0.159, 0.117, 0.159, 0.036, 0.030, 0.115, 0.075, 0.085, 0.129, 0.093**

**avg(p)** so far: **0.085****sd(p)** so far: **0.049**

95% confidence interval for **P** so far: **P = 8.5% ± 1.3%**

Big, Exciting Numbers!

Estimate for Total number of active researchers: **518 ± 103**

Estimate for hours/week spent at h2g2 by average researcher: **14.6 ± 4.5**

Discussion of Error

There are two general types of error that come up in experiments. One is called 'Random Error' and the other is called 'Systematic Error'.

Random Error

Random error isn't so bad. It's the type of error that comes out in the wash. Random error is why we have plus-or-minuses attached to all of our estimates. Basically, random error refers to the fact that, when you make an observation at a random time, you're probably not going to hit on the average. That's why we go to all the trouble of calculating 'standard deviations' and looking at 't tables' and such.

Systematic Error

Systematic error, on the other hand, can be a problem. While random error is just as likely to err in one direction as in the other (and therefore washes out), systematic error is the kind that systematically pushes the results always in one direction.

There are three main possible sources of systematic error in this particular study, two relating to how our data was gathered, and one relating to multiple accounts.

The main estimate in this study, that of Total Researchers, is based on the assumption that members of AGB's list are neither more nor less likely to be found online than the average active researcher.^{5}In short, if researchers AGB has met are*more*likely to be online than others, then our estimate will be*low*, contrariwise, if researchers AGB has met are*less*likely to be online than others, then our estimate will be*high*.

The other possible source of systematic error is that our observations aren't necessarily taken at times which are spread perfectly randomly throughout the week. Actually, our observers have been very good about this, but it's hard to avoid a slight imbalance, just because most of us are (roughly) diurnal, and most of us live either in Europe or in North America, which leaves a few time zones under-represented. The effect of this is similar to the effect of Possible Source of Systematic Error #1. Namely, if we are*more*likely to catch AGB's people online because of geography, then our estimate will tend to be*low*, and contrariwise, vice versa, et cetera, viva zaphoda.

Researchers holding mulitple accounts are an issue. If a researcher who is*not*on AGB's list holds multiple accounts, and they don't keep more than one of them open at the same time, then there's no problem^{6}. The estimate will come out the same, whether they log on from one account all the time, or from a different one every day. If they keep more than one account open at once, however, it will appear that they are more than one person. I don't think many people do this, so we're not going to worry about it.

Researchers who*are*on AGB's list and who have multiple accounts are a more complicated issue. Assuming that only one of their accounts is listed with AGB, they could be logged on from a different account, and they would not necessarily be counted. This would make**P**come out low, which would make our estimate for Total Researchers come out high.

Finally, there's Phil^{7}. He has two active screennames, 'Other Person' and 'Solsbury', the second of which is listed separately on AGB's list. This tends to cancel out the effect of the last paragraph, as far as Phil's concerned, so he's not a problem. You're not a problem, Phil.

Regarding number 1, it is my personal hunch that people AGB has met are *more* likely to be online than the average researcher. Why? Well, they're interested enough in h2g2 that they've bothered to go to meet-ups, for example. This seems to demonstrate a level of committment which probably manifests itself in longer-than-average hours online as well.

On point number 2, I would also guess that, by tending to make observations at the times of day that we inevitably tend to make them, we are *more* likely to catch AGB's people online. Why? Oh, the observations are made by people who have *some* contact with AGB, or they wouldn't have found their way here. Those people are more likely to keep similar hours to other people she's met, I guess.

So, both of these possible sources of systematic error are more likely to bias our estimate of Total Researchers *downward*. Meanwhile, the third effect is more likely to bias our estimate *upward*. Do they balance out? If not, which bias outweighs the other? Beats me. If I had to guess, I'd say that effects #1 and #2 are more significant than effect #3, so it's more likely that our estimate is too low than that it's too high. I'll bet we're pretty close, though.

Incidentally, these errors also carry through to the other numbers we're estimating. The number **R**, the average number of researchers online at once, is more likely to be over-estimated than under-estimated^{8}, and so is the Bonus info, about Average Hours per Week.

Acknowledgements

Many thanks to:

Archangel Galaxy Babe, for hosting the data-gathering, for meeting researchers and keeping track of their names, and for being overall archangelic and froody.

Everyone who has contributed to the data gathering - if not for you and your dilligence in time-zones around the planet, this never could have happened.

The BBC and the Italics, without whom... no h2g2, no study, no fun.

Blaise Pascal, for inventing the science of Statistics all those years ago. Blaise, you the man.

Guinness Breweries, for inspiring the invention of the so-called 'student t' distribution^{9}.

Stevie Wonder, just because.

^{1}No <softshoe> smiley? Where's the <softshoe> smiley!?

^{2}A 'standard deviation' is NOT the term used to describe the average resident of a state-run correctional facility.

^{3}You can't get 100% confidence with statistics. You want 100% confidence, see a priest.

^{4}That's really what it's called! I don't

*think*it's a pun...

^{5}I didn't realize this when we first started, and it occurred to me about a day and a half later.

^{6}Hi Willem!

^{7}Hi Phil!

^{8}It's really 42!

^{9}No joke! Someone was doing some analysis for Guinness and needed a tool better adapted to small-sample statistics than the 'standard normal' distribution. He invented the oh-so-leptokurtic

^{10}t-distribution, and we're all really impressed, I can tell you.

^{10}Don't ask.