The robots.txt file is a simple text file used to guide search engine crawlers on which pages or directories they can and cannot access on a website. It is part of the Robots Exclusion Protocol (REP) and is usually placed in the root directory of a website (e.g., https://little-fire.com/robots.txt). While it does not enforce rules (since crawlers can choose to ignore it), major search engines like Google, Bing, and Yahoo typically respect its directives.

Webmasters use robots.txt to prevent search engines from indexing certain pages, such as login areas, admin dashboards, or duplicate content pages. It is also useful for ensuring that search engines focus on important content instead of unnecessary pages.

Example 1: Blocking All Crawlers from a Specific Directory

User-agent: *
Disallow: /admin/
Disallow: /accounts/
Code language: HTTP (http)

In this example, the User-agent: * applies the rule to all crawlers, while Disallow: /admin/ and Disallow: /accounts/ prevents them from accessing anything inside the /private/ directory.

This Does Not Secure These Directories!

This is not a security measure: it is simply asking crawlers not to enter. Any malicious bot will waltz right in and hoover up anything it can find!

Sensitive information must be protected with authentication rather than relying solely on robots.txt to hide pages from search engines.

This is Incredibly Useful!

Not everyone wants everything to appear in Google. Some files are temporary an some don’t contain human-readable content.

Google has what is known as a crawl budget – an allowance of pages it will index in a given time. If you don’t want it to waste that budget, you can exclude sections of your site entirely, leaving Google to focus on your portfolio, case studies or cat pictures.

Example 2: Allowing Googlebot but Blocking Others

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /
Code language: HTTP (http)

Here, Googlebot is allowed to crawl the entire site, while all other crawlers are blocked.

So What has This Got to Do With WordPress?

In WordPress, Robots.txt used to be exactly that, a little text file sitting in the root directory of your website.

If you call: https://little-fire.com/robots.txt

It still looks like its there: you’ll find some instructions for the bots and crawlers.

But, mysteriously, if you go looking in the file structure of a WordPress site, you won’t find a file with that name:

No robots.txt file — Lots of files but no robots.txt

So Where is robots.txt and What Happened to It?

Like many Content Management Systems, WordPress allows for developers to add functions – e.g. WooCommerce or Yoast. Each has it’s own requirements and so, rather allowing multiple authors to overwrite a single file (too many cooks does not even begin to cover it), WordPress has leveraged .htaccess so that PHP can process it.

In brief, at the end of the default WordPress .htaccess file is the directive:

RewriteRule . /index.php

Which translates roughly as:

“If we haven’t dealt with it yet, bung whatever the user has asked to see at /index.php and see what it makes of it”

Inside WordPress there is a Hook and Filter system which allows developers to process code (a filter) when the code does something (a hook). One of those hooks is robots_txt.

More or less any bit of code can watch for that hook and apply a filter. That way, any developer or coder can contribute to the completed robots.txt. If they don’t know what they are doing, they can overwrite some or all of the whole thing.

So Why Does This Matter?

As we said before, not every bit of code wants to be accessible to crawlers. We have some server-side security (proper security) measures in place to prevent crawlers stealing contact details – a crawler visiting the resulting code will find an error which, in turn, will damage perceived site health (this crawler found a site full of errors) and consequently potential SEO rankings.

As a result we need to add a specific directive to our client’s robots file.

How To Do It

Caveat: this is technical, you can really, really break your site. If in doubt, call Little Fire.

This can readily be done by editing the functions.php file in the active theme.

The functions.php file is run every time the page is rendered, so it is an excellent place to add ad-hoc functionality to a single site. A more portable method would be to use a plugin built for this purpose, but that is beyond the scope of this post.

To begin with, add the filter function. This is a PHP function with quite a specific structure.

/**
* This is a filter function
* @param string $output
* @param boolean $public 
*/
function add_some_robots_commands( $output, $public ) {
  // add some rules
  $output .= "\n\nDisallow: /cgi/\n\n"; 
  // return what you have cooked up
  return $output;
}Code language: PHP (php)

The function accepts two variables (or arguments):

$output – this is all the content thus far by the process of building robots.txt – unless you want to undo other developers’ efforts, you want to append to this content, not overwrite it. Including \n\n in your code will put empty lines around your new rule so that, again, other people’s code is left undisturbed.
$public – this is a boolean true or false, depending upon the user’s context.

Then add some code to invoke the filter function when a hook is called:

add_filter('robots_txt', 'add_some_robots_commands', 44, 2); // filter to add robotsCode language: JavaScript (javascript)

The four arguments in this function call are:

robots_txt – the name of the hook
add_some_robots_commands – the name for of the filter function
44 – this determines the order in which our filter is applied relative to all the others when robots_txt is being built, you may wish to experiment
2 – the number of arguments this particular filter function is expecting (in this case $output and $public)

Save your functions.php file and your robots.txt file should look something like this:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Disallow: /cgi/Code language: HTTP (http)

Easy when you know how (or call Little Fire Digital who really do know how).

Changing Robots.txt in WordPress

Example 1: Blocking All Crawlers from a Specific Directory

This Does Not Secure These Directories!

This is Incredibly Useful!

Example 2: Allowing Googlebot but Blocking Others

So What has This Got to Do With WordPress?

So Where is robots.txt and What Happened to It?

So Why Does This Matter?

How To Do It

Further (Crushingly Dull) Reading

Drop Us a Line

If you have an idea of any size, scale or ambition contact us for a FREE consultation – we’d love to hear about it.

[email protected]

0114 327 9512

Keep In Touch