There is a current and active way to knock a website out of Google's search engine
results. It's simple and effective. This information is already in the public
domain and the more people that know about it, the more likelihood there is that
Google will do something about it. This article will tell you how it works, how
to get a website knocked out of the search engine rankings, but most
importantly, how to defend your own website from having it happen to you.
To understand this exploit, you must first understand about Google's
Duplicate Content filter. It's simply described thus: Google doesn't want you to
search for "blue widget" and have the top 10 search terms returned copies of the
same article on how great blue widgets are. They want to give you ONE copy of
the Great Blue Widget article, and 9 other different results, just on the off
chance that you've already read that article and the other results are actually
what you wanted.
To handle this, every time Google spiders and indexes a page, it checks it to
see if it's already got a page that is predominantly the same, a duplicate page
if you will. Exactly how Google works this out, nobody knows exactly, but it is
going to be a combination of some or all of: page text length, page title,
headings, keyword densities, checking exactly copy sentence fragments etc. As a
result of this duplicate content filter, a whole industry has grown up around
trying to get round the filter. Just search for "spin article".
Getting back to the story here, Google indexes a page and lets say it fails
it's duplicate content check, what does Google do? These days, it dumps that
duplicate page in Google's Supplemental Index. What, you didn't know that Google
has 2 indexes? Well they do: the main one, and a supplemental one. Two things
are important here: Google will always return results from their Main index if
they can; and they will only go to the Supplemental index if they don't get
enough joy from their main index. What this means is that if your page is in the
supplemental index, it's almost certain that you will never show up in the
Search Engine Ranking Pages, unless there is next to no competition for the
phrase that was searched for.
This all seems pretty reasonable to me, so what's the problem? Well there's
another little step I haven't mentioned yet. What happens if someone copies your
page, let's say your homepage of your business website, and when Google indexes
that copy, it correctly determines that it's a duplicate. Now Google knows about
2 pages that it knows are duplicates, it has to decide which to dump in the
supplemental index, and which to keep in the main one. That's pretty obvious
right? But how does Google know which is the original and which is the copy?
They don't. Sure they have some clever algorithms to work it out, but even if
they are 99% accurate, that leaves a lot of problems for that 1% of times they
can get it wrong!
And this is the heart of the exploit, if someone copies your website's
homepage say, and manages to convince Google that *their* page is the original,
your homepage will get tossed into the supplemental index, never to see the
light of day in the Search Engine Ranking Pages again. In case I'm not being
clear enough, that's bad! But wait, it gets worse:
It's fair to say that in the case of a person physically copying your page
and hostíng it, you can often get them to take it down through the use of
copyright lawyers, and cease and desist letters to ISP's and the like, with a
quick "Reinclusion Request" to Google. But recently there's a new threat that's
a whole lot harder to stop: the use of publicly accessible Proxy websites. (If
you don't know what a Proxy is, it's basically a way of making the web run
faster by caching content more local to your internet destination. In principle,
they are generally a good thing.)
There are many such web proxies out there, and I won't líst any here, however
I will describe the process: they send out spiders (much like Google's) and they
spider your page, take your content, then they host a copy of your website on
their proxy site, nominally so that when their users request your page, they can
serve up their local copy quickly rather than having to retrieve if off your
server. The big issue is that Google can sometimes decide that the proxy copy of
your web page is the original, and yours is not.
Worse again, there's some evidence that people are deliberately and
maliciously using proxy servers to cache copies of web pages, then using normal
(white and black hat) Search Engine Optimization (SEO) techniques to make those
proxy pages rank in the search engine, increasing the likelihood that your
legitímate page will be the one dumped by the search engines' duplicate content
filters. Danger Will Robinson!
Even worse still, some of the proxy spiders actively spoof their origins so
that you don't realise that it's a spider from a proxy, as they pretend to be a
Googlebot for example, or from Yahoo. This is why the major search engines
actively publish guidelines on how to identify and validate their own spiders.
Now for the big question, how can you defend against this? There are several
possible solutions, depending on your web hostíng technology and technical
competence:
Option 1
- If you are running Apache and PHP on your server, you can set the
webhost up to check for search engine spiders that purport to be from the main
search engines, and using php and the .htaccess file, you can block proxies from
other sources. However this only works for proxies that are playing by the rules
and identifying themselves correctly.
Option 2
- If you are using MS Windows and IIS on your server, or if you are
on a shared hostíng solution that doesn't give you the ability to do anything
clever, it's an awful lot harder and you should take the advice of a
professional on how to defend yourself from this kind of attack.
Option 3
- This is currently the best solution available, and applies if you
are running a PHP or ASP based website: you set ALL pages robot meta tags to
noindex and nofollow, then you implement a PHP or ASP scrípt on each page that
checks for valid spiders from the major search engines, and if so, resets the
robot meta tags to index and follow. The important distinction here is that it's
easier to validate a real spider, and to discount a spider that's trying to
spoof you, because the major search engines publish processes and procedures to
do this, including IP lookups and the like.
So, stay aware, stay knowledgeable, and stay protected. And if you see that
you've suddenly been dumped from the Search Engine Rankings Pages, now you might
know why, how and what to do about it.
>
About The Author
Sophie White is an Internet Marketing and Website Promotion Consultant at Intrinsic
Marketing an SEO and Pay-Per-Click firm dedicated to supplying Better Website ROI.