Friday, May 18, 2007

An estimate of the Probability of getting a 4 Letter Acronym Formed by Blog Subtitles

Kent at The Digression blog asks: What is the probability of a blog's subtitle's first letter of each word producing a meaningful acronym?

Apparently the subtitle for my blog makes the acronym MPEG. MPEG is the acronym for a file format that contains movies or motion pictures.

Well be careful what you ask for, cause here's my answer! :-)

First, let's simplify this by asking, what is the probability of producing a meaningful 4 letter acronym given that the blog has a 4 word subtitle.

The probability of getting a meaningful 4 letter acronym given that the blog has a 4 word subtitle is: P(4 letter meaningful Acronym blog has 4 word subtitle).

Using Bayes rule of conditional probability, we can say that P(4A blog4subtitle) = P(blog4subtitle 4A) x P(4A) / P(blog4subtitle).

The P(4A) = #meaningful 4 letter Acronyms / #of Possible 4 letter Acronyms

The P(blog4subtitle) = #of blogs with 4 word subtitles / #of blogs, the probability that out of all blogs, the chosen blog has a 4 word subtitle.

The conditional probability P(blog4subtitle 4A) = Probability of getting a blog with 4 words in the subtitle given that it has a 4 letter meaningful acronym) = 1

So then P(4A blog4subtitle) = 1xP(4A) / P(blog4subtitle).

Let's proceed shall we?

A rough estimation of the #of meaningful 4 letter acronyms is... well that's kind of hard. Ok so here's where we can get all statistical.

On Wikipedia we can find a list of all acronyms known to wikipedia . Sampling the population of A acronyms I can count that for each section of the A page (26 sections, the acronyms are broken down into the AA, AB, AC...AZ sections) there are about 10 4-letter acronyms. Let's assume that the actual number per section is distributed according to a normal distribution ~ N(10, 2)...i.e. The number of 4 letter acronyms per section is 10+-about 2 per section. We can do this because for large n, the binomial distribution is approximated by a normal distribution. There are 26 sections per page...and 26 pages...thus 676 sections. Taking this into account, we can say that the number of 4 letter acronyms in existence has a sampled distribution of ~N(676*10, 676*2) which means that there will be on average an estimated 6760 4 letter acronyms based on our small sample distribution of 1 section of the A acronyms on wikipedia.

On the other hand computing the number of possible 4 letter acronyms is easy.... 26X26X26X26 = 456976.

Estimating the number of blogs with 4 letter subtitles is also difficult. But again, let's say that the number words in the subtitle of a blog is distributed according to a binomial distribution with mean 6. Assuming that the max number of words in the subtitle is 20, the probability of getting a 4 word subtitle can be approximated by 20!/[4! x 16!] x .3^4 x (.7)^16 = .13 This means that obtaining a blog with 4 letters would be the probability of obtaining a 4 letter blog times the number of blogs available.

The available #of blogs is 66 million...according to BlogHerald .

So(P(blog4subtitle) = .13 x 66million/66million = .13

Thus our final estimation of the probability of getting a meaningful acronym given a four word subtitle is distributed according to a N(6760/456976/.113, 1352/456976/.113) distribution. (It is a distribution because I had to estimate the number of 4 letter acronyms in order not to have to count them.) Thus I can't be 100% sure what the real probability is. However the mean probability from my estimation is about 11.3% with a variation of 2%. And I can say with 99% confidence that the true probability of obtaining a 4 letter meaningful acronym given my 4 word subtitle lies between 6% and 16%.

That was fun wasn't it? With statistics anything is possible to estimate!

2 comments:

Kent said...

*jaw drops to floor*

What can I say? There I was just popping off, and now I read an actual worked out solution! Very cool!

Even better, I could actually sort of follow along.

Nice job - you put the "fun" in "MPEG".

Now, my next task is to verify your work. I shall now log on to each and every blog and check the subtitles. Could take a while. Better get started...

Nathan said...

My favorite Acronym is TLA. It stand for Three Letter Acronym. The great thing is that TLA is a TLA. There are a lot of TLAs in the IT(another Acronym) field.