How Google Web Crawler Works

Have you ever wanted to know how Google’s web crawlers sees your web pages? Curious about what happens when Google requests a web page? If so, let’s learn about the Crawling process of Google search engine.

Since Google is a search engine for many different media type, it also has different crawlers for different purposes.

For general web search, you can use Googlebot on your web site knowing that Googlebot will honor the directives you place in robots.txt file. For example:

User-agent: Googlebot Disallow:

Above instructs Googlebot that it can crawl your entire website (think it as saying I Disallow you NOT) . But what if you wanted to tell Google that certain parts of your website shouldn’t be crawled, then you would simply disallow the file or folder path like this:

User-agent: Googlebot Disallow: /foldernametonotcrawl/ Disallow: /filenametonotcrawlthankyoupage.html

Name of Google’s Crawlers

Crawler	User Agent Token	Full user agent string (as seen in website log files)
Googlebot (Google Web search)	Googlebot	Mozilla/5.0 (compatible; Googlebot/2.1;) or (rarely used): Googlebot/2.1
Googlebot Images	Googlebot-Image	Googlebot-Image/1.0
Googlebot Video	Googlebot-Video	Googlebot-Video/1.0
Googlebot News	Googlebot-News	Googlebot-News
Google Mobile (feature phone)	Googlebot-Mobile	SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1;) DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;)
Google Smartphone	Googlebot	Currently: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1;) Beginning mid-April, 2016: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1;)
Google AdSense	Mediapartners-Google	Mediapartners-Google
Google Mobile AdSense	Mediapartners-Google	[various mobile device types] (compatible; Mediapartners-Google/2.1;)
Google AdsBot landing page quality check	AdsBot-Google	AdsBot-Google

Understanding the Difference Between Google Crawling Compared to Indexing Web Pages

You can use robots.txt file directives for disallowing Google access to certain parts on your website. However, if Google can somehow find those URL’s (perhaps through your internal linking structure) or through external backlinks. Then, Google will still index those URL’s even if you disallowed through using robots.txt file directives.

If this has already occurred for some of your web pages, then, first remove robots.txt file directives as its only to do with crawling and NOT indexing.

Knowing that, if you want to control Google’s ability to not index certain web pages on your site, then use this meta tag

<head> <meta tag for Googlebot noindex would go here> </head> For WordPress CMS Use This Format for controlling indexing for certain pages. <head> <?php if (is_page('PageName') ) : ?> <meta tag for Googlebot noindex would go here> <?php endif; ?> </head>

IMPORTANT: use the noindex directive only on web pages that you do NOT want Google to index. For example: if the web page that I don’t want Google to index is named samplewebpage.html then I would place the above code only on that page and not others. If you get this wrong by setting noindex to all your web pages, your entire website can be de-indexed by Google.

Here’s a Video Lesson That Explains Google Crawling Process

At the end of the day, whether your website has small number of pages or its a medium to large sized website. Using both robots.txt directives coupled with XML sitemaps and meta tags for indexation control, will allow you to have a better optimized website.Free SEO Course Category

Free SEO Course Category

By RankYa

RankYa is a digital services provider dedicated to growing your sales and business website's results. Highly experienced technical problem solver, Google products expert with proven 'Social Media Marketing' skills, RankYa (100% Australian Owned and Operated) is dedicated to helping small businesses to grow.

We're looking forward to contributing towards your online success. Contact Us.

View all of RankYa's posts.

2 comments

Jeff says:

March 9, 2017 at 7:38 am

Love this course.A little suggestion here.maybe it will be better to write a CSS overflow table for the Name of Crawlers heading.
As I am using my phone to browse the table above does not overflow
Meaning it’s out if your blog width for mobile device
Thank you for being generous enough to provide such a wonderful resource.
Rankya will be my favourite hangout Seo blog from now.
Thank you once again.

Reply
1. RankYa says:
  
  March 11, 2017 at 7:17 am
  
  Thank you Jeff, you are indeed experienced as CSS overflow for table suggestion is great, much appreciated
  
  Reply

Questions? Leave a Comment! Cancel reply