Home / Blogs

Silly Bing

John Levine

Bing is Microsoft's newish search engine, whose name I am reliably informed stands for Bing Is Not Google.

A couple of months ago, as an experiment, I put up a one page link farm at wild.web.sp.am. As should be apparent after about three seconds of clicking on the links there, each page has links to 12 other pages, with the page's host name made of three names, like http://aaron.louise.celia.web.sp.am. The pages are generated by a small perl script and a database of a thousand first names. All the pages have the same IP address, although there could be about a billion (1000 cubed, since there are three names in each page name) possible domains. I forgot about it until earlier this week, when the disk with my web logs filled up.

My web logs are normally 10 to 15 megabytes a week, but all of a sudden the logs ballooned past a gigabyte. A quick look at the logs revealed that my web server was getting hammered by the bingbot.

Every search engine has a "spider" or "bot" that visits web pages to collect data for its index. It's quite normal to see a fair number of log entries from bots as various search engines wander around your web pages looking to see what's changed.

But it was not normal to see the bingbot hammering on my link farm, ten queries a second, day after day. When I noticed it, the bingbot had already visited about 15 million times, fetching 15 million nearly identical pages. I added a robots.txt file, telling bingbot to go away. It didn't help, which wasn't that surprising; since each page is in a different domain, each page could hypothetically have its own different robots file, so while the robots file should stop future indexing, it won't affect any pages that Bing had queued up from previous visits. How many did it have queued up? A lot. Bing scooped up over a million copies of the robots file, at which point I adjusted the web server configuration to return an error page when the bingbot tried to fetch a link farm page, but to return the robots file normally. Still didn't help, it fetched a lot of robots files and a lot of error pages, I think of different domains.

Since the link farm has its own IP address, it was easy to add low level packet filters to reject all traffic to that address from the 12 addresses of the bingbot. I unfiltered for a few minutes today, and it's still hammering as hard as ever.

While this isn't doing any great damage, if I didn't have the skills to look at logs and write suitable packet filters, or if I were paying by the byte for network traffic, it could have crashed my system or cost me a lot of money.

Bing is not the only search engine to have discovered my link farm. Google's Googlebot-Mobile/2.1 visits the link farm every few seconds, claiming to be various kinds of Japanese mobile phones. But Bing's traffic is orders of magnitude more than everyone else's put together. (This is just a problem for the link farm, the rest of my web sites get along with Bing just fine.)

My main question is how these highly sophisticated search engines have failed to notice that they have fetched several million almost identical pages from the same IP address and blacklist it. I have reason to believe that Bing management is aware of the issue, so maybe they'll stop it some time. Or maybe even let on what happened.

By John Levine, Author, Consultant & Speaker. More blog posts from John Levine can also be read here.

Related topics: Web


Don't miss a thing – get the Weekly Wrap delivered to your inbox.


You've been BINGED! Phil Howard  –  Jul 16, 2012 1:04 PM PDT

You've been BINGED!

To post comments, please login or create an account.

Related Blogs

Related News

Explore Topics

Dig Deeper

Mobile Internet

Sponsored by Afilias Mobile & Web Services


Sponsored by Verisign

DNS Security

Sponsored by Afilias

IP Addressing

Sponsored by Avenue4 LLC

Promoted Posts

Buying or Selling IPv4 Addresses?

Watch this video to discover how ACCELR/8, a transformative IPv4 trading platform developed by industry veterans Marc Lindsey and Janine Goodman, enables organizations to buy or sell IPv4 blocks as small as /20s. more»

Industry Updates – Sponsored Posts

Radix Announces Global Web Design Contest, F3.space

.TECH Gets Its Big Hollywood Break

Major Media Websites Lose Audience Due to Slow Load Times on Mobile

DeviceAtlas' Deep Device Intelligence Now Addresses Native App Environment

A Look at How the New .SPACE TLD Has Performed Over the Past 2 Years

Why .com is the Venture Capital Community's Power Player

Miss.Africa Announces 2016, Round II Seed Funding Tech Initiative for Women in Africa

Airpush Chooses DeviceAtlas to Provide Device Awareness to Mobile Ad Network

DeviceAtlas Releases Q2 2016 Mobile Web Intelligence Report, Apple Loses Browsing Market Share

Effective Strategies to Build Your Reseller Channel (Webinar)

Facilitating a Trusted Web Space for Financial Service Professionals

News.Markets: A Rising Star in the World of Financial Trading and New TLDs

Mobile Web Intelligence Report: Bots and Crawlers May Represent up to 50% of Web Traffic

DeviceAtlas Brings Device Awareness to HAProxy

Verisign Launches New Monthly Blog Series: Top 10 Keywords Registered in .COM and .NET

Standards and Browser Compatibility

.nyc Goes Public to Brand the Big Apple

Mobile Web Traffic: A Dive Into the Data

Four Reasons to Move from .COM to Your .BRAND Domain

Dot Brand: Why Your Brand Needs Its Own Top-Level Domain