How Google Web Crawler Works
Have you ever wanted to know how Google’s web crawlers sees your web pages? Curious about what happens when Google requests a web page? If so, let’s learn about the Crawling process of Google search engine.
Since Google is a search engine for many different media type, it also has different crawlers for different purposes.
For general web search, you can use Googlebot on your web site knowing that Googlebot will honor the directives you place in robots.txt file. For example:
Tells Googlebot that it can crawl your entire website. But what if you wanted to tell Google that certain parts of your website shouldn’t be crawled, then you would use these directives.
Name of Google’s Crawlers
|Crawler||User Agent Token||Full user agent string (as seen in website log files)|
|Googlebot (Google Web search)||Googlebot||Mozilla/5.0 (compatible; Googlebot/2.1;)
(rarely used): Googlebot/2.1
|Google Mobile (feature phone)||Googlebot-Mobile||
|Google Mobile AdSense||Mediapartners-Google||[various mobile device types] (compatible; Mediapartners-Google/2.1;)|
|Google AdsBot landing page quality check||AdsBot-Google||AdsBot-Google|
Understanding the Difference Between Google Crawling Compared to Indexing Web Pages
You can use robots.txt file directives for disallowing Google access to certain parts on your website. However, if Google can somehow find those URL’s (perhaps through your internal linking structure) or through external backlinks. Then Google will still index those URL’s even if you disallowed through using robots.txt file directives.
If this has already occurred for some of your web pages, then visit the Google Webmaster Tools tutorial as it explains how to remove URL’s from Google search engine results page.
Knowing that, if you want to control Google’s ability to not index certain web pages on your site, then use this meta tag
<meta name="Googlebot" content="noindex">
IMPORTANT: use the noindex directive only on web pages that you don’t want Google to index. For example: if the web page that I don’t want Google to index is named samplewebpage.html then I would place the above code only on that page and not others.