Structural search
Structural search lets you match richer syntax patterns specifically in code and structured data formats like JSON. It can be awkward or difficult to match code blocks or nested expressions with regular expressions. To meet this challenge we've introduced a new and easier way to search code that operates more closely on a program's parse tree. We use Comby syntax for structural matching. Below you'll find examples and notes for this language-aware search functionality.
Example
The fmt.Sprintf
function is a popular print function in Go. Here is a pattern
that matches all the arguments in fmt.Sprintf
calls in our code:
fmt.Sprintf(...)
See it live on Sourcegraph's code ↗
The ...
part is special syntax that matches all characters inside the
balanced parentheses (...)
. Let's look at two interesting variants of
matches in our codebase. Here's one:
fmt.Sprintf("must be authenticated as an admin (%s)", isSiteAdminErr.Error())
Note that to match this code we didn't have to do any special thinking about
handling the parentheses (%s)
that happen inside the first string argument,
or the nested parentheses that form part of Error()
. Unlike regular
expressions, no "overmatching" can happen and the match will always respect
balanced parentheses. With regular expressions, taking care to match the closing
parentheses for this call could, in general, really complicate matters.
Here is a second match:
fmt.Sprintf( "rest/api/1.0/projects/%s/repos/%s/pull-requests/%d", pr.ToRef.Repository.Project.Key, pr.ToRef.Repository.Slug, pr.ID, )
Here we didn't have to do any special thinking about matching contents that
spread over multiple lines. The ...
syntax by default matches across newlines.
Structural search supports various balanced syntax like ()
, []
, and {}
in
a language-aware way. This allows to match large, logical blocks or expressions
without the limitations of typical line-based regular expression patterns.
Syntax reference
The syntax ...
above is an alias for a canonical syntax :[hole]
, where
hole
is a descriptive identifier for the matched content. Identifiers are
useful when expressing that matched content should be equal (see the return :[v.], :[v.]
example below). See additional
syntax below
Syntax | Alias | Description |
---|---|---|
... |
:[hole] :[_] |
match zero or more characters in a lazy fashion. When :[hole] is inside delimiters, as in {:[h1], :[h2]} or (:[h]) , holes match within that group or code block, including newlines. |
:[~regexp] |
:[hole~regexp] |
match an arbitrary regular expression regexp . A descriptive identifier like hole is optional. Avoid regular expressions that match special syntax like ) or .* , otherwise your pattern may fail to match balanced blocks. |
:[[_]] :[[hole]] |
:[~\w+] :[hole~\w+] |
match one or more alphanumeric characters and underscore. |
:[hole\n] |
:[~.*\n] :[hole~.*\n] |
match zero or more characters up to a newline, including the newline. |
:[ ] :[ hole] |
:[~[ \t]+] :[hole~[ \t]+] |
match only whitespace characters, excluding newlines. |
:[hole.] |
[_.] |
match one or more alphanumeric characters and punctuation like . , ; , and - that do not affect balanced syntax. Language dependent. |
Note: to match the string ...
literally, use regular expression patterns like
:[~[.]{3}]
or :[~\.\.\.]
.
Rules. Comby supports rules to
express equality constraints or pattern-based matching. Comby rules are not
officially supported in Sourcegraph yet. We are in the process of making that
happen and are taking care to address stable performance and usability. That
said, you can explore rule functionality with an experimental rule:
parameter.
For example:
buildSearchURLQuery(:[first], ...) rule:'where match :[first] { | " query: string" -> true }'
↗
More examples
Below you'll find more examples. Also see our blog post for additional examples.
Match stringy data
Taking the original fmt.Sprintf(...)
example, let's modify the original
pattern slightly to match only if the first argument is a string. We do this by
adding string quotes around ...
. Adding quotes communicates structural
context and changes how the hole behaves: it will match the contents of a
single string delimited by "
. It won't match multiple strings like "foo", "bar"
.
fmt.Sprintf("...", ...)
See it live on Sourcegraph's code ↗
Some matched examples are:
fmt.Sprintf("external service not found: %v", e.id)
fmt.Sprintf("%s/campaigns/%s", externalURL, string(campaignID))
Holes stop matching based on the first fragment of syntax that comes after it, similar to lazy regular expression matching. So, we could write:
fmt.Sprintf(:[first], :[second], ...)
to match all functions with three or more arguments, matching the the first
and second
arguments based on the contextual position around the commas.
Match equivalent expressions
Using the same identifier in multiple holes adds a constraint that both of the matched values must be syntactically equal. So, the pattern:
return :[v.], :[v.]
will match code where a pair of identifier-like syntax in the return
statement are the same. For example, return true, true
, return nil, nil
, or return 0, 0
.
See it live on Sourcegraph's code ↗
Match JSON
Structural search also works on structured data, like JSON. Use patterns to declaratively describe pieces of data to match. For example the pattern:
"exclude": [...]
matches all parts of a JSON document that have a member "exclude"
where the value is an array of items.
See it live on Sourcegraph's code ↗
Current functionality and configuration
Structural search behaves differently to plain text search in key ways. We are continually improving functionality of this new feature, so please note the following:
-
Only indexed repos. Structural search can currently only be performed on indexed repositories. See configuration for more details if you host your own Sourcegraph installation. Our service hosted at sourcegraph.com indexes approximately 200,000 of the most popular repositories on GitHub. Other repositories are currently unsupported. To see whether a repository on your instance is indexed, visit
https://<sourcegraph-host>.com/repo-org/repo-name/-/settings/index
. -
The
lang
keyword is semantically significant. Adding thelang
keyword informs the parser about language-specific syntax for comments, strings, and code. This makes structural search more accurate for that language. For example,fmt.Sprintf(...) lang:go
. Iflang
is omitted, we perform a best-effort to infer the language based on matching file extensions, or fall back to a generic structural matcher. -
Saved search are not supported. It is not currently possible to save structural searches.
-
Matching blocks in indentation-sensitive languages. It's not currently possible to match blocks of code that are identation-sensitive. This is a feature planned for future work.