Continuous integration

SOC2/GN-105 SOC2/GN-106

Sourcegraph uses a continuous integration and delivery tool, Buildkite, to help ensure a consistent build, test and deploy process. Software changes are systematically required to complete all steps within the continuous integration tool workflow prior to production deployment, in addition to being peer reviewed.

Sourcegraph also maintains a variety of tooling on GitHub Actions for continuous integration and repository maintainence purposes.

Buildkite pipelines

Tests are automatically run in our various Buildkite pipelines when you push your changes to GitHub. Pipeline steps are generated using the pipeline generator.

To see what checks will get run against your current branch, use sg:

sg ci preview

A complete reference of all available pipeline types and steps is available in the generated Pipeline reference. You can also see these docs locally with sg ci docs.

You can also request builds for your changes for you builds using sg ci build.

To learn about making changes to our Buildkite pipelines, see Pipeline development.

Pipeline steps

Soft failures

SOC2/GN-106

Many steps in Sourcegraph's Buildkite pipelines allow for soft failures, which means that even if they fail they do not cause the entire build to be failed.

In the Buildkite UI, soft failures currently look like the following, with a triangular warning sign (not to be mistaken for a hard failure!):

soft fail in Buildkite UI

We use soft failures for the following reasons only:

  • Steps that determine whether a subsequent step should run, where soft failures are the only technical way to communicate that a later step should be skipped in this manner using Buildkite.
  • Regular analysis tasks, where soft failures serve as an monitoring indicator to warn the team responsible for fixing issues.
  • Temporary exceptions to accommodate experimental or in-progress work.

You can find all usages of soft failures with the following queries:

All other failures are hard failures.

Image vulnerability scanning

Our CI pipeline scans uses Trivy to scan our Docker images for security vulnerabilities. Refer to our Pipeline reference to see what pipelines Trivy checks run in.

If there are any HIGH or CRITICAL severities in a Docker image that have a known fix:

  1. The CI pipeline will create an annotation that contains links to reports that describe the vulnerabilities
  2. The Trivy scanning step will soft fail. Note that soft failures do not fail builds or block deployments. They simply highlight the failing step for further analysis.

We also run separate vulnerability scans for our infrastructure.

Pipeline health

Maintaining Buildkite pipeline health is a critical part of ensuring we ship a stable product - changes that make it to the main branch may be deployed to various Sourcegraph instances, and having a reliable and predictable pipeline is crucial to ensuring bugs do not make it to production environments.

To enable this, we address flakes as they arise and mitigate the impacts of pipeline instability with branch locks.

Branch locks

buildchecker is a tool responding to periods of consecutive build failures on the main branch Sourcegraph Buildkite pipeline. If it detects a series of failures on the main branch, merges to main will be restricted to members of the Sourcegraph team who authored the failing commits until the issue is resolved - this is referred to as a "branch lock". When a build passes on main again, buildchecker will automatically unlock the branch.

Authors of the most recent failed builds are responsible for investigating failures. Please refer to the Continuous integration playbook for step-by-step guides on what to do in various scenarios.

Flakes

A flake is defined as a test or script that is unreliable or non-deterministic, i.e. it exhibits both a passing and a failing result with the same code. In other words: something that sometimes fails, but if you retry it enough times, it passes, eventually.

Tests are not the only thing that are flaky - flakes can also encompass sporadic infrastructure issues and unreliable steps.

Flaky tests

Typical reasons why a test may be flaky:

  • Race conditions or timing issues
  • Caching or inconsistent state between tests
  • Unreliable test infrastructure (such as CI)
  • Reliance on third-party services that are inconsistent

If a flaky test is discovered, immediately use language-specific functionality to skip a test and open a PR to disable the test:

If the language or framework allows for a skip reason, include a link to the issue track re-enabling the test, or leave a docstring with a link.

Then open an issue to investigate the flaky test (use the flaky test issue template), and assign it to the most likely owner.

Flaky steps

If a step is flaky we need to get the build back to reliable as soon as possible. If there is not already a discussion in #buildkite-main create one and link what step you take. Here are the recommended approaches in order:

  1. Revert the PR if a recent change introduced the instability. Ping author.
  2. Use Skip StepOpt when creating the step. Include reason and a link to context. This will still show the step on builds so we don't forget about it.

An example use of Skip:

--- a/enterprise/dev/ci/internal/ci/operations.go
+++ b/enterprise/dev/ci/internal/ci/operations.go
@@ -260,7 +260,9 @@ func addGoBuild(pipeline *bk.Pipeline) {
 func addDockerfileLint(pipeline *bk.Pipeline) {
        pipeline.AddStep(":docker: Lint",
                bk.Cmd("./dev/ci/docker-lint.sh"),
+               bk.Skip("2021-09-29 example message https://github.com/sourcegraph/sourcegraph/issues/123"),
        )
 }
Flaky infrastructure

If the build or test infrastructure itself is flaky, then open an issue with the team/devx label and notify the Developer Experience team.

Also see Buildkite infrastructure.

Pipeline development

The source code of the pipeline generator is in /enterprise/dev/ci. Internally, the pipeline generator determines what gets run over contributions based on:

  1. Run types, determined by branch naming conventions, tags, and environment variables
  2. Diff types, determined by what files have been changed in a given branch

The above factors are then used to determine the appropriate operations, composed of step options, that translate into steps in the resulting pipeline.

Run types

Diff types

Operations

Developing PR checks

To create a new check that can run on pull requests on relevant files, refer to how diff types work to get started.

Then, you can add a new check to CoreTestOperations. Make sure to follow the best practices outlined in docstring.

For more advanced pipelines, see Run types.

Step options

Creating annotations

Annotations get rendered in the Buildkite UI to present the viewer notices about the build. The pipeline generator provides an API for this that, at a high level, works like this:

  1. In your script, leave a file in ./annotations:
if [ $EXIT_CODE -ne 0 ]; then
  echo -e "$OUT" >./annotations/docsite
fi
  1. In your pipeline operation, replace the usual bk.Cmd with bk.AnnotatedCmd:
  pipeline.AddStep(":memo: Check and build docsite",
    bk.AnnotatedCmd("./dev/check/docsite.sh", bk.AnnotatedCmdOpts{
      Annotations: &bk.AnnotationOpts{},
    }))
  1. That's it!

For more details about best practices and additional features and capabilities, please refer to the bk.AnnotatedCmd docstring.

Caching build artefacts

For caching artefacts in steps to speed up steps, see How to cache CI artefacts.

Observability

Pipeline command tracing

Every successful build of the sourcegraph/sourcegraph repository comes with an annotation pointing at the full trace of the build on Honeycomb.io. See the Buildkite board on Honeycomb for an overview.

Individual commands are tracked from the perspective of a given step:

  pipeline.AddStep(":memo: Check and build docsite", /* ... */)

Will result in a single trace span for the ./dev/check/docsite.sh script. But the following will have individual trace spans for each yarn commands:

  pipeline.AddStep(fmt.Sprintf(":%s: Puppeteer tests for %s extension", browser, browser),
    // ...
    bk.Cmd("yarn --frozen-lockfile --network-timeout 60000"),
    bk.Cmd("yarn workspace @sourcegraph/browser -s run build"),
    bk.Cmd("yarn run cover-browser-integration"),
    bk.Cmd("yarn nyc report -r json"),
    bk.Cmd("dev/ci/codecov.sh -c -F typescript -F integration"),

Therefore, it's beneficial for tracing purposes to split the step in multiple commands, if possible.

Test analytics

Our test analytics is currently powered by a Buildkite beta feature for analysing individual tests across builds called Buildkite Analytics. This tool enables us to observe the evolution of each individual test on the following metrics: duration and flakiness.

Browse the dashboard to explore the metrics and optionally set monitors that will alert if a given test or a test suite is deviating from its historical duration or flakiness.

In order to track a new test suite, test results must be converted to JUnit XML reports and uploaded to Buildkite. The pipeline generator provides an API for this that, at a high level, works like this:

  1. In your script, leave your JUnit XML test report in ./test-reports
  2. Create a new Test Suite in the Buildkite Analytics UI.
  3. In your pipeline operation, replace the usual bk.Cmd with bk.AnnotatedCmd:
pipeline.AddStep(":jest::globe_with_meridians: Test",
  withYarnCache(),
  bk.AnnotatedCmd("dev/ci/yarn-test.sh client/web", bk.AnnotatedCmdOpts{
    TestReports: &bk.TestReportOpts{/* ... */},
  }),
  1. That's it!

For more details about best practices and additional features and capabilities, please refer to the bk.AnnotatedCmd docstring.

Buildkite infrastructure

Our continuous integration system is composed of two parts, a central server controled by Buildkite and agents that are operated by Sourcegraph within our own infrastructure. In order to provide strong isolation across builds, to prevent a previous build to create any effect on the next one, our agents are stateless jobs.

When a build is dispatched by Buildkite, each individual job will be assigned to an agent in a pristine state. Each agent will execute its assigned job, automatically report back to Buildkite and finally shuts itself down. A fresh agent will then be created and will stand in line for the next job.

This means that our agents are totally stateless, exactly like the runners used in GitHub actions.

Also see Flaky infrastructure, Continous integration infrastructure, and the Continuous integration changelog.

Pipeline setup

To set up Buildkite to use the rendered pipeline, add the following step in the pipeline settings:

go run ./enterprise/dev/ci/gen-pipeline.go | buildkite-agent pipeline upload

Managing secrets

The term secret refers to authentication credentials like passwords, API keys, tokens, etc. which are used to access a particular service. Our CI pipeline must never leak secrets:

  • to add a secret, use the Secret Manager on Google Cloud and then inject it at deployment time as an environment variable in the CI agents, which will make it available to every step.
  • use an environment variable name with one of the following suffixes to ensure it gets redacted in the logs: *_PASSWORD, *_SECRET, *_TOKEN, *_ACCESS_KEY, *_SECRET_KEY, *_CREDENTIALS
  • while environment variables can be assigned when declaring steps, they should never be used for secrets, because they won't get redacted, even if they match one of the above patterns.

GitHub Actions

buildchecker

buildchecker

buildchecker, our branch lock management tool, runs in GitHub actions - see the workflow specification.

To learn more about buildchecker, refer to the buildchecker source code and documentation.

pr-auditor

pr-auditor

pr-auditor, our PR audit tool, runs in GitHub actions - see the workflow specification.

To learn more about pr-auditor, refer to the pr-auditor source code and documentation.

Third-party licenses

Licenses Update Licenses Check

We use the license_finder tool to check third-party dependencies for their licenses. It runs as a GitHub Action on pull requests, which will fail if one of the following occur:

  • If the license for a dependency cannot be inferred. To resolve:
    • Use license_finder licenses add <dep> <license> to set the license manually
  • If the license for a new or updated dependency is not on the list of approved licenses. To resolve, either:
    • Remove the dependency
    • Use license_finder ignored_dependencies add <dep> --why="Some reason" to ignore it
    • Use license_finder permitted_licenses add <license> --why="Some reason" to allow the offending license

The license_finder tool can be installed using gem install license_finder. You can run the script locally using:

# updates ThirdPartyLicenses.csv
./dev/licenses.sh

# runs the same check as the one used in CI, returning status 1
# if there are any unapproved dependencies ('action items')
LICENSE_CHECK=true ./dev/licenses.sh

The ./dev/licenses.sh script will also output some license_finder configuration for debugging purposes - this configuration is based on the doc/dependency_decisions.yml file, which tracks decisions made about licenses and dependencies.

For more details, refer to the license_finder documentation.