Will Murphy's personal home page

fly.io feels pretty magical

In a previous post, I talked about migrating an app from free-tier Heroku to AWS AppRunner. At the time, I was working for AWS and seeing CDK typescript in my sleep, so it made sense to try. In this post, I’ll discuss why I am migrating again, this time to fly.io.

I want to emphasize at the outset that this is a hobby project. This it not a how-to for putting real, automatic

First, an overview of the steps, and then some more thoughts about the motivation and some more opinion-piece stuff.

Process overview

Step 0 was to get the app running in a container and reading configuration from the environment. CloudFoundry, Heroku, and the like have done a good job of pushing that deep into best practices (see the 12 factor app website). I had already done this, basically, to get the app to work in AWS AppRunner, so I won’t discuss it here. More details are available in my original post.

Step 1 basically consisted of installing the fly CLI tool and running fly launch in the directory where the app was cloned.

Step 2 was to get the database migrated from RDS to fly.io. I was pretty nervous about this step, because fly.io spends a lot of time telling you that it’s offering is not managed postgres (example), so I was afraid I wouldn’t be able to get this done, or wouldn’t be able to automate backups, or would be losing sleep about failover or something. It turns out that I had more work on the RDS side than the fly.io side, and that setting up backups wasn’t too bad.

Step 3 was configuring auto pause for the database. This was pretty easy, and was a main reason for the switch. Amazon RDS supposedly has auto-pause on Aurora serverless v1 instances, but these have a number of other goofy limitations that keeps them from being useful, and are no longer the latest-and-greatest offering so I don’t think they’re getting new versions of PostgreSQL.

Step 1: fly launch

This part was really magical, and reminded me a lot of my time working on CloudFoundry. (“Here is some code. Run it in the cloud for me. I do not care how.”)

  1. Install fly CLI. Sign up at fly.io if you haven’t already, and then run fly auth login.
  2. Change the app (settings.py since this is Django) to configure postgres connection from the environment variables that fly.io sets instructions
  3. Go into the app directory and run fly launch.
  4. Say yes when it asks if you want to change configurations, and then in the web app that comes up, ask for the development postgres.
  5. The CLI will print out a connection string for the new postgres. Copy and paste that stuff into your password manager.
  6. See the app come up and crash
  7. Go back and fix the connection string to use the app-specific user and password you created when you were fighting with AppRunner and Aurora on RDS (optional)

Step 2 Rescue the Data from RDS

Once I decided to migrate from AppRunner+RDS to fly.io + fly.io (running postgres), I had to get the actual data out of RDS and into fly.io. This had a lot of false starts and frustration. There are two big limitations I had to work around, both due to my mistaken belief that Amazon RDS Aurora Serverless v1 was equivalent to regular RDS in 2 ways:

  1. Aurora Serverless cannot export snapshots to S3 (link)
  2. Aurora Serverless does not have the “publicly accessible” boolean setting that lets you set ingress rules on your VPC and connect to the cluster from a local machine.

I am pretty mad about both of these, because the UI doesn’t really report these limitations, so this was a blind alley that took the majority of the project’s actual wall clock time. (For example, I set up a role that had sts:AssumeRole from the RDS snapshots service principal and had permissions to write to my S3 bucket before I learned that this wouldn’t work.)

Here’s what worked:

  1. Take a manual snapshot of the Aurora Serverless v1 cluster
  2. Restore that snapshot to a regular RDS Postgres cluster
  3. Configure the new RDS cluster with the “publicly accessible” flag
  4. Set an ingress rule on the VPC for the new RDS cluster that allowed my laptop’s IP address to connect on PostgreSQL’s preferred port (5432).
  5. Run the pg_dump utility with -F c (for “format: custom”) to get a backup
  6. Try to restore the backup with pg_restore
  7. Create some missing users that RDS had created and that my app needed
  8. Try pg_restore again, and it worked.
  9. Snapshot the new, capable RDS instance also into S3, just out of paranoia, and then delete it, because we’re doing this to save money, not to leave RDS instances running all day.

Step 3

I’m using a development postgres instance to save money, because the project being migrated is very much a hobby project. To do that, I want to scale the postgres app to zero when I’m not using it. (This desire is what started the whole stupid misadventure with Aurora Serverless.)

fly.io at least has official docs on scaling postgres down to zero.

Here’s what worked:

  1. fly config save -a $APP_NAME_OF_DB
  2. Edit fly.toml and add FLY_SCALE_TO_ZERO = '1h' to the [env] section
  3. Run fly deploy . --image flyio/postgres-flex:15.6

This made me sort of nervous, because I couldn’t figure out how to get the postgres app on fly.io to tell me what image it was running, so I wasn’t confident that I wasn’t performing an accidental upgrade. I went with it because I’d just backed everything up every which way. In retospect, I could probably have connected to the postgres server and asked which version it was running.

This worked - now my database sleeps if it’s been idle for more than an hour.

Bonus Step 4: easy backups

Fly.io reminds users that, essentially, they are deploying postgres server as one more app in their fleet, and that the service isn’t fully managed. I think they’ve built a bit more than that, (for example, there’s a fly postgres subcommand that can be used to list the age and retention time of DB snapshots, which seems pretty managed to me), but anyway, the official docs made me nervous here, so I wanted I to backup everything locally and regularly, so that even something dumb like losing a fly.io account wouldn’t cost me my data.

Here’s an overview of how the backup script works:

  1. curl the app to wake it up
  2. Get a password from op, the 1Password CLI
  3. Using the fly CLI to proxy localhost to the fly instance running the db
  4. pg_dump to a local file

And here’s the script, using some variables instead of real values:

#!/usr/bin/env bash

# get credentials
export PGPASSWORD=$(op item get "fly.io - postgres creds" --fields label=password --format json | jq -r .value)
# wake up the recipe website or database will be asleep when we try to connect
curl "$APP_URL" || true

# proxy the database so I can talk to it on localhost
fly proxy 5454:5432 -a "$APP_NAME" &

# Capture the PID of the last background process so I can kill the proxy comment later
export FLY_PROXY_PID=$!

echo "Proxy at PID $FLY_PROXY_PID"

sleep 30 # let the proxy come up

pg_dump -h localhost -p 5454 -U postgres -F c -b -v -f "$(date -I).backup" "$DB_NAME"

kill $FLY_PROXY_PID

Why Migrate

This is a separate section of the post, because it might turn into a bit of a rant. Basically, I am migrating away from RDS more than away from AppRunner. I was using Aurora Serverless v1 because it scales to zero when idle, and I was hoping it would keep this hobby project from having very much IaaS spend. Sometime this spring, the cluster started spinning up to 2 capacity units for 47 hours at a time, each episode costing slightly less than $3. This was super frustrating, since it wasn’t legit traffic causing the spin ups.

I feel like AWS marketing uses “Serverless” the ways 90s and 2000s Microsoft marketing used the word “Visual”. This is “Visual Basic”, and “Visual Studio” and “Visual C#.Net” and whatever. I think the origin is that, when Visual Studio came out, it had a GUI editor for WinForms UIs, which was really cool, but then the marketing folks decide the products with “Visual” in the name sell and stuck the word on everything. AWS seems to be doing this with serverless now. Lambda is pretty serverless in that (1) I never noticeably wait for servers and (2) I don’t pay any actual money for functions that aren’t being invoked. Aurora Serverless doesn’t approach that serverless magic of scale-to-zero-cost and rapid start times, so the word “serverless” feels like an after thought.

Conclusion

Switching to fly.io let me capture some of the magic that CloudFoundry and Heroku had back in the day. I can just write code. DevOps in a box, just add credit card. It was really nice.

Another nice benefit is that the cold start time is much faster for fly.io. Going to my app when it’s been idle for a while results in a quick start up.

The main thing I don’t have with fly.io is a good sense that this is a battle-tested thing that will Stay Up. Since I’ve only moved a hobby project at this point, it’s hard to say whether fighting a prod outage on a Friday night is better or worse at fly.io. I will say, the control surface I’ve seen is simpler, but it’s also a much newer platform, and it might have gaps and rough edges that I haven’t found yet.

Anyway, for now, fly.io has my vote for hobby projects, and I’d recommend you check it out.

Till next time, happy learning!
– Will

Comments

Note: recently submitted comments may not be visible yet; the approval process is manual. Please be patient, and check back soon!

Join the conversation!