Life of a repository

This document describes how our backend systems clone and update repositories from a code host.

High level

An admin configures a code host configuration.
repo-updater periodically syncs all repository metadata from configured code hosts.
We poll the code host's API based on the configuration.
We add/update/remove entries in our repo table.
All repositories in our repo table are in a scheduler on repo-updater which ensures they are cloned and updated on gitserver.

Our guiding principle is to ensure all repositories configured by a site administrator are cloned and up to date. However, we need to avoid overloading a code host with API and Git requests.

Services

repo-updater is responsible for communicating with code host APIs and co-ordinating the state we synchronise from them. It is a singleton service. It is responsible for maintaining the repo table which other services read. It is also responsible for scheduling clones/fetches on gitserver. It is also responsible for anything which communicates with a code host API. So our campaigns and background permissions syncers also live in repo-updater.

gitserver is a scaleable stateful service which clones git repositories and can run git commands against them. All data maintained on this service is from cloning an upstream repository. We shard the set of repositories across the gitserver replicas. The main RPC gitserver supports is exec which returns the output of the specified git command.

Discovery

Before we can clone a repository, we first must discover that is exists. This is configured by a site administrator setting code host configuration. Typically a code host will have an API as well as git endpoints. A code host configuration typically will specify how to communicate with the API and which repositories to ask the API for. For example:

{
  "url": "https://github.com",
  "token": "deadbeaf",
  "repositoryQuery": ["affiliated"],
}

This is a GitHub code host configuration for github.com using the private access token deadbeaf. It will ask GitHub for all affiliated repositories. Follow GithubSource.listRepositoryQuery to find the actual API call we do.

Discovering the repositories for each codehost/configuration is abstracted in the Sources interface.

// A Source yields repositories to be stored and analysed by Sourcegraph.
// Successive calls to its ListRepos method may yield different results.
type Source interface {
	// ListRepos sends all the repos a source yields over the passed in channel
	// as SourceResults
	ListRepos(context.Context, chan SourceResult)
	// ExternalServices returns the ExternalServices for the Source.
	ExternalServices() ExternalServices
}

Syncing

We keep a list of all repositories on Sourcegraph in the repo table. This is so to provide a code host independent list of repositories on Sourcegraph that we can quickly query. repo-updater will periodically list all repositories from all sources and update the table. We need to list everything so we can detect which repositories to delete. See Syncer.Sync for details.

Git Update Scheduler

We can't clone all repositories concurrently due to resource constraints in Sourcegraph and on the code host. So repo-updater has an update scheduler. Cloning and fetching are treated in the same way, but priority is given to newly discovered repositories.

The scheduler is divided into two parts:

updateQueue is a priority queue of repositories to clone/fetch on gitserver.
schedule which places repositories onto the updateQueue when it thinks it should be updated. This is what paces out updates for a repository. It contains heuristics such that recently updated repositories are more frequently checked.

Repositories can also placed onto the updateQueue if we receive a webhook indicating the repository has changed. (We don't by default setup webhooks when integrating into a code host). When a user directly visits a repository on Sourcegraph we also enqueue it for update.

The update scheduler has conf.GitMaxConcurrentClones workers processing the updateQueue and issuing git clone/fetch commands.