How Google Web Crawler Works

Have you ever wanted to know how Google’s web crawlers sees your web pages? Curious about what happens when Google requests a web page? If so, let’s learn about the Crawling process of Google search engine.

Since Google is a search engine for many different media type, it also has different crawlers for different purposes.

For general web search, you can use Googlebot on your web site knowing that Googlebot will honor the directives you place in robots.txt file. For example:

User-agent: Googlebot

Tells Googlebot that it can crawl your entire website. But what if you wanted to tell Google that certain parts of your website shouldn’t be crawled, then you would use these directives.

User-agent: Googlebot
Disallow: /foldernametonotcrawl/
Disallow: /filenametonotcrawlthankyoupage.html

Name of Google’s Crawlers

Crawler User Agent Token Full user agent string (as seen in website log files)
(Google Web search) Googlebot Mozilla/5.0 (compatible; Googlebot/2.1;)
(rarely used): Googlebot/2.1
Googlebot-Image Googlebot-Image/1.0
Googlebot Video Googlebot-Video Googlebot-Video/1.0
Googlebot-News Googlebot-News
Google Mobile (feature phone) Googlebot-Mobile
  • SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/ (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1;)
  • DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;)
Google Smartphone Googlebot
  • Currently: Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1;)
  • Beginning mid-April, 2016: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1;)
Mediapartners-Google Mediapartners-Google
Mediapartners-Google [various mobile device types] (compatible; Mediapartners-Google/2.1;)
landing page quality check AdsBot-Google AdsBot-Google

Understanding the Difference Between Google Crawling Compared to Indexing Web Pages

You can use robots.txt file directives for disallowing Google access to certain parts on your website. However, if Google can somehow find those URL’s (perhaps through your internal linking structure) or through external backlinks. Then Google will still index those URL’s even if you disallowed through using robots.txt file directives.

If this has already occurred for some of your web pages, then visit the  as it explains how to remove URL’s from Google search engine results page.

Knowing that, if you want to control Google’s ability to not index certain web pages on your site, then use this meta tag

<meta name="Googlebot" content="noindex">

IMPORTANT: use the noindex directive only on web pages that you don’t want Google to index. For example: if the web page that I don’t want Google to index is named samplewebpage.html then I would place the above code only on that page and not others.

Here’s a Video Lesson That Explains Google Crawling Process

At the end of the day, whether your website has small number of pages or its a medium to large sized website. Using both robots.txt directives coupled with XML sitemaps and meta tags for indexation control, will allow you to have a better optimized website.

Author: RankYa

Online Entrepreneur, Qualified Web Developer, Google AdWords and Google Analytics Professional. Specialist in: Google SEO, Website Optimization, WordPress, Structured Data, JSON-LD, Microdata, Microformats, RDF, Vocabulary, HTML5, Advanced Image Optimization, Google Search Console, Google Webmaster Guidelines, Social Media Marketing, Facebook marketing and YouTube video ranking.

Thank you for sharing this blog post. Description: 'FREE SEO Course > How Google Web Crawler Works when visiting your website. Learn about the crawling process of Google search engine, includes a video lesson'