Ultimate Guide to robots.txt: How to Optimise Your Site’s Crawl and Indexing

In SEO, managing how search engines interact with your site is crucial, and the robots.txt file is a key tool in this process. This simple text file guides search engine bots on which pages to crawl and index. Properly configuring robots.txt can enhance your site’s visibility and prevent indexing of unnecessary sections. In this guide, we’ll cover the basics of robots.txt, how to set it up correctly, and best practices to optimise your site’s crawl efficiency and SEO performance.

By Oscar Mărginean - Co-founder & Tech Mage at Shelf Wizard. Has been spending his time figuring out how to make software do things so you won't have to.

August 9, 2024

What is the robots.txt file and how does it work?

The robots.txt file is a simple text file placed in the root directory of your website, typically found at https://www.example.org/robots.txt. It serves as a set of instructions for search engine bots, guiding them on which parts of your site they should or should not crawl.

When a crawler visits your site, it first checks the robots.txt file to understand which areas it is allowed or restricted from accessing. Proper configuration of this file helps manage your site’s crawl efficiency and can enhance your SEO by ensuring that search engines index only the most relevant pages.

The anatomy of a robots.txt file

The key elements of a robots.txt file

The robots.txt file is structured with specific components that guide how web crawlers interact with your website. Here’s a breakdown of its key elements:

User-agent

Specifies which search engine bot or crawler the following rules apply to, allowing you to control which directories are indexed and to adjust the crawl speed to prioritise certain types of bots. This is useful for managing how different bots interact with your site.

Example: 

User-agent: Googlebot

The most common User-Agent strings in robots.txt files correspond to popular web crawlers and search engine bots. Here are some of the most frequently encountered ones:

  • Googlebot - Google’s web crawler for indexing pages on Google Search.

  • Storebot-Google - Google’s web crawler for Google Shopping ads.

  • Googlebot-Image - Google’s web crawler specifically for indexing images.

  • Googlebot-News - Google’s web crawler for indexing news articles.

  • Bingbot - Bing’s web crawler for indexing pages on Bing Search.

  • AdIdxBot - Bing’s crawler for verifying links which are advertised on Bing Ads

  • Slurp - Yahoo’s web crawler (though Yahoo Search now uses Bing’s technology)

  • Baiduspider - Baidu’s web crawler for indexing pages on Baidu Search.

  •  YandexBot - Yandex’s web crawler for indexing pages on Yandex Search.

  • DuckDuckBot - DuckDuckGo’s web crawler used for indexing pages for DuckDuckGo Search.

Disallow

The Disallow directive helps in managing which parts of the site are indexed by search engines by instructing crawlers not to access certain pages or directories. It helps prevent the indexing of duplicate or low-quality pages, which can negatively impact SEO performance.

Example:

Disallow: /private/

Allow

Overrides a Disallow directive to permit access to specific pages within a disallowed directory. This is particularly useful when combined with the User-Agent directive to only allow some crawlers to access certain parts of your website.

Example:

Allow: /private/allowed-page.html

Sitemap

Specifies the URL of your XML sitemap, aiding crawlers in finding and indexing all key pages on your site. While it's not required if the sitemap is located at its default address, including this directive is still considered good practice.

Example:

Sitemap: https://www.example.org/sitemap.xml

Crawl-delay

Specifies the time interval (in seconds) between successive requests to your server by a crawler.  By specifying a delay between requests, this directive ensures that the server's resources are used efficiently and that the site remains responsive for human visitors. This is particularly useful for sites with limited server capacity or those experiencing high traffic, as it helps balance the load and improves overall site performance while still allowing search engines to crawl and index the content effectively.

Example:

Crawl-delay: 30

Comments

Allows you to include notes or explanations within the file, which are ignored by crawlers. This is primarily useful for the maintainers of the robots.txt file.

Example:

# This is a comment

An example robots.txt file for Google Shopping Ads

Now that we have seen the building blocks of the robots.txt file, let’s explore a practical example of a robots.txt file specifically tailored for Google Shopping Ads. This example will illustrate how to strategically configure your robots.txt to ensure that relevant product data is properly crawled and indexed, while sensitive or non-essential areas are appropriately restricted. Understanding and implementing these directives can enhance your site's visibility in Google Shopping results and optimise the overall performance of your ad campaigns.

User-agent: * Disallow: /checkout/ Disallow: /cart/ Disallow: /my-account/ Disallow: /admin/ Disallow: /cgi-bin/ Disallow: /private/ Disallow: /temporary/ Allow: /products/ Allow: /category/ Allow: /images/ Allow: /css/ Allow: /js/ Crawl-delay: 30 Sitemap: https://www.example.org/sitemap.xml # Google User-agent: Storebot-Google  Crawl-delay: 10 User-agent: Nutch Disallow: /

The line-by-line breakdown of our robots.txt file

User-agent * - Targets all crawlers:

  • Disallow Directives:

    • /checkout/: Blocks access to the checkout page to prevent indexing of sensitive transaction-related content.

    • /cart/: Blocks the shopping cart page.

    • /my-account/: Prevents access to user account areas.

    • /admin/: Blocks access to administrative sections of the site.

    • /cgi-bin/: Prevents access to any CGI scripts.

    • /private/ and /temporary/: Blocks directories that contain temporary or private files.

  • Allow Directives:

    • /products/: Ensures product pages are accessible for indexing.

    • /category/: Allows crawling of category pages where products are listed.

    • /images/: Allows access to image files, which are important for product listings.

    • /css/ and /js/: Ensures CSS and JavaScript files are accessible so the pages are rendered correctly.

  • Crawl-delay Directive:

    • Crawl-delay: 30: Ensures that the website is not crawled too aggressively - depending on your hosting provider this can prevent server load issues or avoid excessive costs

  • Sitemap Directive:

    • Sitemap: https://www.example.org/sitemap.xml: Provides the URL of the XML sitemap, helping Google Shopping Bot discover all important product pages and other relevant content.

User-agent Storebot-Google - Targets the Google Shopping crawler:

  • Crawl-delay: 10: Overrides the previous Crawl-delay directive to reduce the delay. For larger inventories, it is recommended to keep the delay as short as possible for the Google Shopping crawler to ensure any errors are identified and corrected promptly.

User-agent Nutch - Targets the Apache Nutch crawler:

  • Disallow: /: Instructs the crawler to not crawl the website at all. Note that compliance with robots.txt is voluntary, and while most reputable search engines like Google, Bing, Yahoo and Apache Nutch respect the instructions in robots.txt, not all crawlers do.

Configuring the robots.txt file for your website

How to set up the robots.txt file for Magento 2

  1. Access the Admin Panel:

    • Log in to your Magento admin panel.

  2. Navigate to the Configuration:

    • Go to Stores > Configuration.

    • Under General, select Design.

  3. Edit the robots.txt Settings:

    • Expand the Search Engine Robots section.

    • Here you can customise the robots.txt file content.

  4. Example Configuration:

User-agent: * Disallow: /admin/ Disallow: /checkout/ Disallow: /customer/ Disallow: /catalogsearch/ Disallow: /wishlist/ Allow: /pub/static/ Allow: /pub/media/ Sitemap: https://www.example.org/sitemap.xml
  1. Save the Configuration:

    • Click Save Config to apply the changes.

How to set up the robots.txt file for PrestaShop

  1. Access the Admin Panel:

    • Log in to your PrestaShop admin panel.

  2. Generate robots.txt:

    • Go to Preferences > SEO & URLs.

    • Scroll down to the robots.txt file generation section.

  3. Generate the File:

    • Click Generate robots.txt file.

  4. Example Configuration: PrestaShop automatically generates a robots.txt file, but you can manually edit it if needed.

User-agent: * Disallow: /admin/ Disallow: /classes/ Disallow: /config/ Disallow: /download/ Disallow: /mails/ Disallow: /modules/ Disallow: /translations/ Disallow: /tools/ Sitemap: https://www.example.org/1_index_sitemap.xml
  1. Save and Verify:

    • Ensure the robots.txt file is saved in your site’s root directory and verify it’s accessible at https://www.example.org/robots.txt.

How to set up the robots.txt file for Shopify

  1. Access the Theme Editor:

    • Log in to your Shopify admin panel.

    • Go to Online Store > Themes > Actions > Edit Code.

  2. Create/Edit robots.txt:

    • Shopify automatically generates a robots.txt file, but you can edit it by creating a new snippet.

  3. Example Configuration:

    • Create a new snippet named robots.txt.liquid and add your directives.

User-agent: * Disallow: /admin/ Disallow: /cart/ Disallow: /checkout/ Disallow: /orders/ Sitemap: https://www.example.org/sitemap.xml

  1. Add Snippet to Theme:

    • In your theme’s layout folder, find the theme.liquid file and include the snippet.

{% include 'robots.txt' %}
  1. Save and Verify:

    • Save your changes and verify the robots.txt file by visiting https://www.example.org/robots.txt.

How to set up the robots.txt file for WordPress / WooCommerce 

  1. Install and Activate an SEO Plugin:

    • Install an SEO plugin like Yoast SEO or All in One SEO Pack to easily manage your robots.txt.

  2. Access the Plugin Settings:

    • Navigate to the SEO plugin settings in your WordPress dashboard.

  3. Edit robots.txt:

    • In Yoast SEO, go to SEO > Tools > File editor.

    • In All in One SEO Pack, go to All in One SEO > Feature Manager > Robots.txt.

  4. Example Configuration:

User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /cart/ Disallow: /checkout/ Disallow: /my-account/ Sitemap: https://www.example.org/sitemap.xml
  1. Save Changes:

    • Save your changes and verify the robots.txt file by visiting https://www.example.org/robots.txt.

Conclusion

Setting up a robots.txt file is a critical step for managing how search engines interact with your e-commerce site, whether you use Magento, PrestaShop, Shopify, or WordPress / WooCommerce. Proper configuration helps ensure that important pages are indexed while keeping sensitive or irrelevant sections hidden from crawlers. By following the steps outlined for each platform, you can optimise your site’s crawl efficiency and enhance your SEO performance.

Get in touch!

* Required fields