Life of a repository
This document describes how our backend systems clone and update repositories from a code host.
High level
- An admin configures a code host configuration.
repo-updater
periodically syncs all repository metadata from configured code hosts.- We poll the code host's API based on the configuration.
- We add/update/remove entries in our
repo
table. - All repositories in our
repo
table are in a scheduler onrepo-updater
which ensures they are cloned and updated ongitserver
.
Our guiding principle is to ensure all repositories configured by a site administrator are cloned and up to date. However, we need to avoid overloading a code host with API and Git requests.
Services
repo-updater
is responsible for communicating with code host APIs and co-ordinating the state we synchronise from them. It is a singleton service. It is responsible for maintaining the repo
table which other services read. It is also responsible for scheduling clones/fetches on gitserver
. It is also responsible for anything which communicates with a code host API. So our campaigns and background permissions syncers also live in repo-updater
.
gitserver
is a scaleable stateful service which clones git repositories and can run git commands against them. All data maintained on this service is from cloning an upstream repository. We shard the set of repositories across the gitserver replicas. The main RPC gitserver supports is exec
which returns the output of the specified git command.
Discovery
Before we can clone a repository, we first must discover that is exists. This is configured by a site administrator setting code host configuration. Typically a code host will have an API as well as git endpoints. A code host configuration typically will specify how to communicate with the API and which repositories to ask the API for. For example:
{ "url": "https://github.com", "token": "deadbeaf", "repositoryQuery": ["affiliated"], }
This is a GitHub code host configuration for github.com
using the private access token deadbeaf
. It will ask GitHub for all affiliated repositories. Follow GithubSource.listRepositoryQuery
to find the actual API call we do.
Discovering the repositories for each codehost/configuration is abstracted in the Sources interface
.
// A Source yields repositories to be stored and analysed by Sourcegraph. // Successive calls to its ListRepos method may yield different results. type Source interface { // ListRepos sends all the repos a source yields over the passed in channel // as SourceResults ListRepos(context.Context, chan SourceResult) // ExternalServices returns the ExternalServices for the Source. ExternalServices() ExternalServices }
Syncing
We keep a list of all repositories on Sourcegraph in the repo
table. This is so to provide a code host independent list of repositories on Sourcegraph that we can quickly query. repo-updater
will periodically list all repositories from all sources and update the table. We need to list everything so we can detect which repositories to delete. See Syncer.Sync
for details.
Git Update Scheduler
We can't clone all repositories concurrently due to resource constraints in Sourcegraph and on the code host. So repo-updater
has an update scheduler. Cloning and fetching are treated in the same way, but priority is given to newly discovered repositories.
The scheduler is divided into two parts:
updateQueue
is a priority queue of repositories to clone/fetch ongitserver
.schedule
which places repositories onto theupdateQueue
when it thinks it should be updated. This is what paces out updates for a repository. It contains heuristics such that recently updated repositories are more frequently checked.
Repositories can also placed onto the updateQueue
if we receive a webhook indicating the repository has changed. (We don't by default setup webhooks when integrating into a code host). When a user directly visits a repository on Sourcegraph we also enqueue it for update.
The update scheduler has conf.GitMaxConcurrentClones
workers processing the updateQueue
and issuing git clone/fetch commands.