Lies, Damn Lies, and Anti-Spam Vendor Press Releases

By J.D. Falk
J.D. Falk

There's a lot of chatter about a recent study purporting to show that 29.1% of internet users has bought something from spam. As ITWire reported, "Marshal were not only interested in how many people were purchasing from a spam source, but also what goods and services they were buying. Perhaps less surprisingly this revealed that sex and drugs sell well online." But at downloadsquad, Lee Mathews discovered the shocking truth: "the survey only involved 600 people."

Lee goes on to ask "is it worse that about 180 of those people bought products from spam, or that media outlets are willing to jump all over a statistic that comes from a sampling of less than .0001% of the roughly 360 million people currently using the internet?" I'd go with the latter. Vendors who make such outrageous claims only make it more difficult for the real facts to be revealed — and the facts are scary enough without any self-serving augmentation.

This article was originally posted on Box of Meat.

By J.D. Falk, Internet Standards and Governance. Visit the blog maintained by J.D. Falk here.

Related topics: Email, Spam

Comments

Small samples and statistics validity Valdis Kletnieks  –  Aug 22, 2008 6:10 AM PDT

Given that the media are willing to jump all over polls that show one candidate for president is 3 or 4 percent ahead of the other, when those polls are based on similarly small sample sizes, yes, it is OK.

In reality, the "sample mean" (the value from the sample population) ends up being very close to the "population mean", for relatively small numbers of samples - you can start doing some statistical tests with a sample size as small as 30.  That number called the "margin of error" is an estimate of how far the sample and population means are probably (usually to a 5% certainty).  For a sample size of 600, the margin of error is probably around 5% - which means that we can be 95% sure that if the sample said 29%, the real value is between 29-5 and 29+5, or the range 24-34%.  And if you think about it, it doesn't really matter - we have the same problem if one out of four, or one out of three, are buying from spam.

A much *bigger* issue, and one that *cannot* be fixed by using a bigger sample, is "selection bias".  For a survey to be statistically valid, you have to establish that the sample reflects the population.  For some things like quality control on a production line, it's pretty easy - pick a random 1 of 200 widgets for testing, and you're done.  For things like political surveys, you have to pre-filter your sample for "likely voters" - you *don't* want to include children, non-citizens, felons, and others not able or likely to vote.  If it's a survey on a website, it becomes more complicated - first, you have the fact that the people who visit the website may not be representative of Internet users in general (for instance, how did they *find* the survey to take it)?  If the survey was advertised via a banner ad or mass mailing, then the results are even more likely to be biased - because the people who responded are the people who respond to the same techniques when used by spammers.  So all you've proven is "the people who click on banner ads and reply to spam survey e-mails are likely to click on banner ads and reply to spam e-mails".

And even if they had 100,000 responses rather than 600, *THAT* flaw would still remain.

(There's other survey design issues, such as "are people likely to have replied honestly?".  How do you filter out pranksters who say "yeah, I've bought " even though they never actually have?  However, the selection bias issue is big enough to overwhelm those other issues).