What is robots.txt, and why did I want to fix it?
is a directive that lets automated processes (“robots”) visiting your website
where you want them to look. Lots of websites have them. They’re pretty
interesting. Lots of robots respect them, though not all of course.
Here’s some excerpts from Wikipedia’s robots.txt:
# # Sorry, wget in its recursive mode is a frequent problem. # Please read the man page and use it properly; there is a # --wait option you can use to set the delay between hits, # for instance. # User-agent: wget Disallow: / # A capture bot, downloads gazillions of pages with no public benefit # http://www.webreaper.net/ User-agent: WebReaper Disallow: /
I’ll let the comments in there speak for themselves. I just think it’s neat to be able to go to a website and read its opinions on which parts which robots are allowed to visit. But until recently, my own website’s robots.txt was returning errors, and I didn’t even know it!
How did I notice this problem?
I noticed this problem when I was messing around in the website. I got what looked like a generic AWS error trying to go to https://willmurphy.me/robots.txt:
<Error> <Code>AccessDenied</Code> <Message>Access Denied</Message> <RequestId>6BFSPA0N0SVSCB0V</RequestId> <HostId>Vyz80l9JWUs3D95uBnUkw5BDp23MRPkV5xbC8bSvV5TpsMj5vwUrx/f166u4y5Gz0CXyJYf1cvU=</HostId> </Error>
How did I approach it?
I first read about Hugo’s robots.txt config, and learned that it should be enabled by default. Then I went down a blind alley where I tried to see where the actual assets were by poking around my AWS account looking for S3 buckets and things, but that turned out to be completely the wrong approach.
After this, I went to the settings for the site in the AWS Amplify page in the AWS web console, and looked at what the build command was:
version: 1 frontend: phases: build: commands: - hugo artifacts: baseDirectory: public files: - '**/*' cache: paths: 
This is telling amplify, “Every new commit to
main on the git repo that
represents the blog, checkout the commit, run
hugo in the base directory of
the package, then make everything that shows up under
./public be visible on
Then I went into my hugo directory locally and, instead of running
hugo server -D to run a Hugo server showing drafts, which is how I normally work on the
blog, I went into the directory and just ran
hugo and then
find . -name robots.txt. Sure enough, it wasn’t generating a robots.txt file.
How did I solve it?
According to Hugo’s docs seem to say it will be generated by default. However, it was my observation that robots.txt is only generated when explicitly configured.
The diff to get it working turns out to be super simple:
diff --git a/config.toml b/config.toml index 725c824..05eee5a 100644 --- a/config.toml +++ b/config.toml @@ -1,4 +1,5 @@ baseURL = 'http://willmurphy.me/' languageCode = 'en-us' title = "Will Murphy's personal home page" theme = "firstTheme" +enableRobotsTXT = true
What’s it look like now?
$ curl https://willmurphy.me/robots.txt User-agent: *
The website is not currently being abused by any bots, so I’m not going to tell any of them to go away. If I start getting spam comments in the comment engine I might have to change this at some point. But for now, robots welcome!
Till next week, happy learning!