Will Murphy's personal home page

What is robots.txt, and why did I want to fix it?

The robots.txt file is a directive that lets automated processes (“robots”) visiting your website where you want them to look. Lots of websites have them. They’re pretty interesting. Lots of robots respect them, though not all of course.

Here’s some excerpts from Wikipedia’s robots.txt:

#
# Sorry, wget in its recursive mode is a frequent problem.
# Please read the man page and use it properly; there is a
# --wait option you can use to set the delay between hits,
# for instance.
#
User-agent: wget
Disallow: /

# A capture bot, downloads gazillions of pages with no public benefit
# http://www.webreaper.net/
User-agent: WebReaper
Disallow: /

I’ll let the comments in there speak for themselves. I just think it’s neat to be able to go to a website and read its opinions on which parts which robots are allowed to visit. But until recently, my own website’s robots.txt was returning errors, and I didn’t even know it!

How did I notice this problem?

I noticed this problem when I was messing around in the website. I got what looked like a generic AWS error trying to go to https://willmurphy.me/robots.txt:

<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>6BFSPA0N0SVSCB0V</RequestId>
<HostId>Vyz80l9JWUs3D95uBnUkw5BDp23MRPkV5xbC8bSvV5TpsMj5vwUrx/f166u4y5Gz0CXyJYf1cvU=</HostId>
</Error>

How did I approach it?

I first read about Hugo’s robots.txt config, and learned that it should be enabled by default. Then I went down a blind alley where I tried to see where the actual assets were by poking around my AWS account looking for S3 buckets and things, but that turned out to be completely the wrong approach.

After this, I went to the settings for the site in the AWS Amplify page in the AWS web console, and looked at what the build command was:

version: 1
frontend:
  phases:
    build:
      commands:
        - hugo
  artifacts:
    baseDirectory: public
    files:
      - '**/*'
  cache:
    paths: []

This is telling amplify, “Every new commit to main on the git repo that represents the blog, checkout the commit, run hugo in the base directory of the package, then make everything that shows up under ./public be visible on the internet”

Then I went into my hugo directory locally and, instead of running hugo server -D to run a Hugo server showing drafts, which is how I normally work on the blog, I went into the directory and just ran hugo and then find . -name robots.txt. Sure enough, it wasn’t generating a robots.txt file.

How did I solve it?

According to Hugo’s docs seem to say it will be generated by default. However, it was my observation that robots.txt is only generated when explicitly configured.

The diff to get it working turns out to be super simple:

diff --git a/config.toml b/config.toml
index 725c824..05eee5a 100644
--- a/config.toml
+++ b/config.toml
@@ -1,4 +1,5 @@
 baseURL = 'http://willmurphy.me/'
 languageCode = 'en-us'
 title = "Will Murphy's personal home page"
 theme = "firstTheme"
+enableRobotsTXT = true

What’s it look like now?

$ curl https://willmurphy.me/robots.txt
User-agent: *

The website is not currently being abused by any bots, so I’m not going to tell any of them to go away. If I start getting spam comments in the comment engine I might have to change this at some point. But for now, robots welcome!

Till next week, happy learning!
– Will