How LSIF indexes are processed

An LSIF indexer produces a file containing the definition, reference, hover, and diagnostic data for a project. Users upload this index file to a Sourcegraph instance, which converts it into an internal format that can support code intelligence queries.

The sequence of actions required to to upload and convert this data is shown below (click to enlarge).

Uploading

The API used to upload an LSIF index is modeled after the S3 multipart upload API. Many LSIF uploads can be fairly large and the network is generally not reliable. To get around frequent failure of large uploads (and to get around uploads limits in Cloudflare), the upload is broken into multiple, independently gzipped chunks. Each chunk is uploaded in sequence to the instances, where it is concatenated into a single file on the remote end. This allows us to retry chunks independently in the case of an upload failure without sacrificing the entire operation.

An initial request adds an upload into the database with the uploading state and marks the number of upload chunks it expects to see. The subsequent requests specify the upload identifier (returned in the initial request), and the index of the chunk that is being uploaded. If this upload part successfully makes it to disk, it is marked as received in the upload record. The last request is a request marking upload completion from the client. At this point, the frontend ensures that all the expected chunks have been received and reside on disk. The frontend informs the blob storage server to concatenate the files, and the upload record is moved from the uploading state to the queued state, where it is made visible to the worker process.

Processing

The worker process polls Postgres for upload records in the queued state. When such a record is available, it is marked as processing and is locked in a transaction to ensure that it is not double-processed by another worker instance. The worker asks the blob storage server for the raw LSIF upload data. Because this data is generally large, the data is streamed to the worker while it is being processed (and retry logic inside the client will retry the request from the last byte it received on transient failures).

The worker then converts the raw LSIF data into a re-indexed internal representation which is inserted into the codeintel database:

  • correlateFromReader step streams raw LSIF data from the blob storage server and produces a stream of JSON objects. Each object in the stream is interpreted as an LSIF vertex or edge. Objects are validated, then inserted into an in-memory representation of the graph.
  • canonicalize step collapses the in-memory representation of the graph produced by the previous step. Most notably, it ensures that the data attached to a range vertex transitively is now attached to the range vertex directly.
  • prune step determines the set of documents that are present in the index but do not exist in git (via an efficient batch of calls to gitserver) and removes references to them from the in-memory representation of the graph. This prevents us from attempting to navigate to locations that are not visible within the instance (generated or vendored paths that are not committed).
  • groupBundleData step converts the canonicalized and pruned in-memory representation of the graph into the shape that will reside in the database. This rotates the data so that it can be efficiently read based on our query access patterns.

This process also produces a set of packages that the indexed source code defines and a set of packages that the indexed source code depends on which is inserted into the frontend (metadata) database to enable cross-repository definition and reference queries. The set of packages defined by and depended on by this index can be constructed from reading the package information attached to export and import monikers, respectively, from the correlated data.

Duplicate uploads (with the same repository, commit, and root) are removed to prevent the frontend from querying multiple indexes for the same data. This can happen if a user re-uploads the same index, or if an index is re-uploaded as part of a CI step that was re-run. In these cases we prefer to keep the newest upload.

The repository is marked as dirty, which informs a process that runs periodically to re-calculate the set of uploads visible to each commit. This process will refresh the commit graph for this repository stored in Postgres.

Finally, if the previous steps have all completed without error, the transaction is committed, moving the upload record from the processing state to the completed state, where it is made visible to the frontend to answer code intelligence queries. On success, the input file that was processed is deleted from the blob storage server. If an error does occur, the upload record is instead moved to the errored state and marked with a failure reason.

Code appendix