Upgrades and Migrator Design
This doc is based on a preface for [RFC 850 WIP: Improving The Upgrade Experience](https://docs.google.com/document/d/1ne8ai60iQnZfaYuB7QLDMIWgU5188Vn4_HBeUQ3GASY/edit) and acts as a summary of the current design of our upgrade and database schema management design. Subsequently additions have been made. The docs initial aim was to provide general information about the relationship of our migrator service to our deployment types and migrator's dependencies during the release process.
Migrator Overview
The migrator
service is a short-lived container responsible for managing
Sourcegraph's databases (pgsql
(also referred to as frontend), codeintel-db
,
and codeinsights-db
), and running schema migrations during startup and upgrades.
Its design accounts for various unique characteristics of versioning and database management at Sourcegraph. Specifically graphical schema migrations, out-of-band migrations, and periodic schema migration squashing.
Sourcegraph utilizes a directed acyclic graph of migration definitions, rather than a linear chain. In Sourcegraph's early days when schema migrations were applied linearly, schema changes were frequent enough that schema changes generally conflicted with the master branch by the time a PR passed CI. Moving to a graph of migrations means, devs won't need to worry about other teammates concurrent schema changes unless they are working on the same table.
Similarly squashing of schema migrations into a root definition reduced the number of migrations run on startup, alleviating a common issue in which frequent transaction locks caused failed migration on Sourcegraph startup. You can learn more in our migrations overview docs. Information on out of bound migrations can also be found there.
Migrator with its relevant artifacts in the sourcegraph/sourcegraph repo can be viewed as an orchestrator with two special functions --
-
Migrator constructs migration plans given version ranges and a table of migrations which have been successfully applied (each schema has a table to track applied migrations within that schema). This logic is supported by a variety of files generated during releases and depends on the parent/child metadata of migrations generated via the sg migration tool.
-
Migrator manages out-of-band migrations. These are data migrations that must be run within specific schema boundaries. Running OOB migrations at/after the deprecated version is unsupported. Migrator ensures that the necessary OOB migrations are run at stopping points in a multiversion upgrade -- learn more here.
CLI design
Migrator is designed with a CLI tool interface -- taking various commands to alter the state of the database and apply necessary schema changes. This design was initially implemented as a tool for TS team members to assist in multiversion upgrades and because it could easily be included over multiple deployment methods as a containerized process run separately from the frontend during startup (which originally caused issues). This replaced earlier go/migrate based strategies which ran in the frontend on startup. While the migrator can operate as a CLI tool, it's containerized as if it was another application which allows it to be run as an initContainer (in Kubernetes deployments) or as a dependent service (for Docker Compose deployments) to ensure that the necessary migrations are applied before startup of the application proper. Check out RFC 469 to learn more.
Important Commands
The most important migrator commands are up
and upgrade
, with a notable mention to drift
:
- Up : The default command of migrator,
up
ensures that for a given version of the migrator, every migration defined at that build has been successfully applied in the connected database, this is specifically important to ensure patch version migrations are run. This must be run with the target version migrator image/build. If migrator and frontend pods are deployed in version lockstep,up
ensures that ALL migrations required by the frontend will be successfully applied prior to boot. This is a syntactic sugar over a more internalupto
command. - Upgrade:
upgrade
runs all migrations defined between two minor versions of Sourcegraph and requires that other services which may access the database are brought down. Before running, the database is checked for and schema drifts in order to prevent a failure while attempting a migration.upgrade
relies onstitched-migration-graph.json
. - Drift: This command pulls runs a diff between the current database schema and an expected definition packaged in migrator during the release process. Many migrator operations run this check before proceeding to ensure the database is in the expected state.
In general, the up
command can be thought of as a standard upgrade
(rolling upgrades with no downtime) while the upgrade command is what
enables multiversion upgrades. In part, up
was designed to maintain
our previous upgrade policy and is thus run as an initContainer (or
initContainer-like mechanism) of the frontend, i.e. between two versions
of Sourcegraph, the antecedent Sourcegraph services will continue to
work after the consequent version's migrations have been applied.
Current Startup Dependency
All of our deployment types currently utilize migrator during startup. The frontend service won't start until migrator has been run with the default up command. The frontend service will also validate the expected schema (and OOB migration progress), and die on startup if this validation pass fails. This ensures that the expected migrations for the version in question have been run.
In docker-compose (see diagram), this is accomplished via a chain of
depends_on
clauses in the docker-compose.yaml
(link).
For our k8s based deployments (including the AMIs) migrator is run as an initContainer within the frontend utilizing the up command on the given pods startup.
Auto-upgrade
Migrator has been incrementally improved over the last year in an
attempt to get closer and closer to auto-upgrades. After migrator v5.0.0
logic was added to the pgsql
database and frontend
/frontend-internal
service to
attempt an automatic upgrade to the latest version of Sourcegraph on the startup of the frontend.
For more information about how this works see the docs. Some notable points:
-
The upgrade operations in this case are triggered and run by the frontend container.
-
Migrator looks for the existence of the env var SRC_AUTOUPGRADE=true on services
sourcegraph-frontend
,sourcegraph-frontend-internal
, andmigrator
. Otherwise it looks in the frontend db for the value of the autoupgrade column. These checks are performed with either the up or upgrade commands defined on the migrator. -
The internal connections package to the DB now uses a special sentinel value to make connection attempts sleep if migrations are in progress.
-
A limited frontend is served by the frontend during an autoupgrade, displaying progress of the upgrade and any drift encountered.
-
All autoupgrades hit the multiversion upgrade endpoint and assume downtime for all Sourcegraph services besides the migrator and dbs.
Migrator Release Artifacts
During the release of migrator we construct and build some artifacts used by migrator to support its operations. Different artifacts must be generated depending on the release type --
-
Major
- lastMinorVersionInMajorRelease: Used to evaluate what oobmigrations must run, must be updated every major release. This essentially tells us when a minor version becomes a major version. It may be useful elsewhere at some point.
-
Minor
-
maxVersionString: Defined in
consts.go
this string is used to tell migrator the latest minor version targetable for MVU and oobmigrations. If not updated multiversion upgrades cannot target the latest release. Note this is used to determine how many versions should be included in thestitched-migration-graph.json
file. -
Stitched-migration.json: Used by multiversion upgrades to unsquash migrations. Generated during release here. Learn more below.
-
-
Patch
-
Git_versions: Defined in
generate.sh
this string array contains versions of Sourcegraph whose schemas should be embedded in migrator during a migrator build to enable drift detection without having to pull them directly from GitHub or, for older versions, from a pre-prepared GCS bucket (this is necessary in air gapped environments). This should be kept up to date with maxVersionString. Learn more. -
Squashed.sql: for each database we generate a new squashed.sql file. It is used to help suggest fixes for certain types of database drift. For example for a missing database column this search is used to suggest a definition.
-
Schema descriptions: schema descriptions are embedded in migrator on each new build as a reference for the expected schema state during drift checking.
-
Stitched Migration JSON and Squashing Migrations
\^\^ Generated with sg migrations visualize --db frontend
on v5.2.0
Squashing Migrations
Periodically we
squash
all migrations from two versions behind the current version into a
single migration using sg migration squash. This reduces the time
required to initialize a new database. This means that the migrator
image is built with a set of definitions embedded that doesn't reflect
the definition set in older versions. For multiversion upgrades this
presents a problem. To get around this, on minor releases we generate a
stitched-migration-graph.json
file. Reference links:
Bazel,
Embed,
Generator
Stitched Migration Graph
stitched-migrations-graph.json
stitches (you can think of this as unsquashing) historic migrations using git magic, enabling the migrator to have a reference of older migrations. This serves a few purposes:
-
When jumping across multiple versions we do not have access to a full record of migration definitions on migrator disk because some migrations will likely have been squashed. Therefore we need a way to ensure we don't miss migrations on a squash boundary. Note we can't just re-apply the root migration after a squash because some schema state that\'s already represented in the root migration. This means the squashed root migration isn't always idempotent.
-
During a multiversion upgrade migrator must schedule out of band migrations to be started at some version and completed before upgrading past some later version. Migrator needs access to the unsquashed migration definitions to know which migrations must have run at the time the oob migration is triggered.
In standard/up
upgrades stitched-migrations.json
isn't necessary. This
is because up
determines migrations to run by comparing
migrations
listed as already run in the relevant db's migration_logs
table directly
to those migration definitions
embedded
in the migrator disk at build time for the current version, and running
any which haven't been run. We never squash away the previous minor
version of Sourcegraph, in this way we can guarantee the migration_logs
table migrations always has migrations in common with the migration
definitions on disk.
Database Drifts
Database drift is any difference between the expected state of the database at a given version and the actual state of the database. How does it come to be? We've observed drift in customer databases for a number of reasons, both due to admin tampering and problems in our own release process:
- A backup/restore process that didn't include all objects: This notably happened in a customer production instance causing their database to have no indexes, primary keys, or unique constraints defined.
- Modifying the database explicitly: These are untracked manual changes to the database schema, observed occasionally in the wild, and in our dotcom deployment.
- Migration failures: that occur during multi-version upgrade or the up command will cause the set of successfully applied migrations to be between two versions, where drift is well-defined.
- Site Admins Error: Errors in git-ops like deploying to production on the wrong version of Sourcegraph manifests have introduced drift. Another source is the incorrect procedure in downgrading.
- Historic Bugs: We, at one point, too eagerly backfilled records that we should've instead applied. This bug was the result of changes being backported to the metadata definition of a migrations parent migrations, violating assumptions made during the generation of the stitched-migration-graph.json.
Database drift existing at the time of a migration can cause migrations
to fail when they try to reference some table property that is not in
the expected state. Not to mention the application may not behave as
expected if drift is present. Migrator includes a drift
command intended
to help admins and CS team members to diagnose and resolve drift in
customer instances. Multiversion upgrades in particular check for drift
before starting unless the --skip-drift-check argument
is supplied.
Implementation Details
Versions & Runner
On startup the migrator service creates a runner.
The runner
is responsible for connecting to the given databases and
running
any schema migrations defined in the embedded migrations
directory
via the up
entry command.
A runner
infers the expected state of the database from schema definitions
embeded in migrator
when a migrator image is compiled. What this means is that migrator's concept of version,
is the set of migrations defined in the migrations
at compile time. In this way the up
command easily facilitates dev versions of database schemas.
This "version" definition at compile time also tightly binds migrators concept of "version" to a given tag of Sourcegraph.
The up
command will only initialize a version of Sourcegraph, when the migrator
used to run up
is the tagged version associated with the desired Sourcegraph version.
For this reason a later version of migrator
cannot be used to initialize an earlier
version of Sourcegraph.
For example, you use the latest migrator release v5.6.9
to run the upgrade
command bringing your databases from
v4.2.0
to v5.6.3
, rather than v5.6.9
. Your security team hasn't approved images past this point. The upgrade command will have applied OOB migrations and schema migrations defined up to v5.6.0
, the last minor release. To start your image you'll need to run migrator
up
using the v5.6.3
image, this will apply any schema migrations which may have been defined in the patch releases up to v5.6.3
and thus existent in the embeded from migrations
directory at the time of migrators compilation.
Migration Plan
While the up
command's concept of version is a set of embedded definitions --
the upgrade
command does have a concept of schema migrations associated to
version. This is the stitched-migration-graph.json
. This file is generated on
minor releases of Sourcegraph, and defines migrations expected to have been run
at each minor version. This is necessary for two reasons --
- The root migration defined in the
migration
directory is a squashed migration, meaning, it represents many migrations composed into a single sql statement. - Out of Bound migrations are triggered at a given version, and must complete before the schema is changed in some subsequent version.
This means that when applying migrations defined accross multiple versions, migrator
must stop and wait for OOB migrations to complete. To do this it needs to
know which migrations should have run at a given stopping point, which may have been
obscured by a subsequent squashing operation. This is where the stitched-migration-graph.json
file comes into play. It defines the set of migrations that should have been run at
a given minor version. Helping to construct a "migration plan" or path for runner
to traverse.
The stitched-migration.json
file is generated on every minor release, and is informed
by the state of the acyclic graph of migrations defined in the migration
directory.