Depsets

Depsets are a specialized data structure for efficiently collecting data across a target’s transitive dependencies. Since this use case concerns the analysis phase, depsets are useful for authors of rules and aspects, but probably not macros.

The defining feature of depset is its time- and space-efficient union operation. The depset constructor accepts a list of elements (“direct”) and a list of other depsets (“transitive”), and returns a depset representing a set containing all the direct elements and the union of all the transitive sets. Conceptually, the constructor creates a new graph node that has the direct and transitive nodes as its successors. Depsets have a well-defined ordering semantics, based on traversal of this graph.

Example uses of depsets include:

  • storing the paths of all object files for a program’s libraries, which can then be passed to a linker action

  • for an interpreted language, storing the transitive source files that will be included in an executable’s runfiles

If you don’t need the union operation, consider using another type, such as list or dict.

Full example

This example is available at https://github.com/bazelbuild/examples/tree/master/rules/depsets.

Suppose we have a hypothetical interpreted language Foo. In order to build each foo_binary we need to know all the *.foo files that it directly or indirectly depends on.

# //depsets:BUILD

load(":foo.bzl", "foo_library", "foo_binary")

# Our hypothetical Foo compiler.
py_binary(
    name = "foocc",
    srcs = ["foocc.py"],
)

foo_library(
    name = "a",
    srcs = ["a.foo", "a_impl.foo"],
)

foo_library(
    name = "b",
    srcs = ["b.foo", "b_impl.foo"],
    deps = [":a"],
)

foo_library(
    name = "c",
    srcs = ["c.foo", "c_impl.foo"],
    deps = [":a"],
)

foo_binary(
    name = "d",
    srcs = ["d.foo"],
    deps = [":b", ":c"],
)
# //depsets:foocc.py

# "Foo compiler" that just concatenates its inputs to form its output.
import sys

if __name__ == "__main__":
  assert len(sys.argv) >= 1
  output = open(sys.argv[1], "wt")
  for path in sys.argv[2:]:
    input = open(path, "rt")
    output.write(input.read())

Here, the transitive sources of the binary d are all of the *.foo files in the srcs fields of a, b, c, and d. In order for the foo_binary target to know about any file besides d.foo, the foo_library targets need to pass them along in a provider. Each library receives the providers from its own dependencies, adds its own immediate sources, and passes on a new provider with the augmented contents. The foo_binary rule does the same, except that instead of returning a provider, it uses the complete list of sources to construct a command line for an action.

Here’s a complete implementation of the foo_library and foo_binary rules.

# //depsets/foo.bzl

# A provider with one field, transitive_sources.
FooFiles = provider(fields = ["transitive_sources"])

def get_transitive_srcs(srcs, deps):
  """Obtain the source files for a target and its transitive dependencies.

  Args:
    srcs: a list of source files
    deps: a list of targets that are direct dependencies
  Returns:
    a collection of the transitive sources
  """
  return depset(
        srcs,
        transitive = [dep[FooFiles].transitive_sources for dep in deps])

def _foo_library_impl(ctx):
  trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps)
  return [FooFiles(transitive_sources=trans_srcs)]

foo_library = rule(
    implementation = _foo_library_impl,
    attrs = {
        "srcs": attr.label_list(allow_files=True),
        "deps": attr.label_list(),
    },
)

def _foo_binary_impl(ctx):
  foocc = ctx.executable._foocc
  out = ctx.outputs.out
  trans_srcs = get_transitive_srcs(ctx.files.srcs, ctx.attr.deps)
  srcs_list = trans_srcs.to_list()
  ctx.actions.run(executable = foocc,
                  arguments = [out.path] + [src.path for src in srcs_list],
                  inputs = srcs_list + [foocc],
                  outputs = [out])

foo_binary = rule(
    implementation = _foo_binary_impl,
    attrs = {
        "srcs": attr.label_list(allow_files=True),
        "deps": attr.label_list(),
        "_foocc": attr.label(default=Label("//depsets:foocc"),
                             allow_files=True, executable=True, cfg="host")
    },
    outputs = {"out": "%{name}.out"},
)

You can test this by copying these files into a fresh package, renaming the labels appropriately, creating the source *.foo files with dummy content, and building the d target.

Description and operations

Conceptually, a depset is a directed acyclic graph (DAG) that typically looks similar to the target graph. It is constructed from the leaves up to the root. Each target in a dependency chain can add its own contents on top of the previous without having to read or copy them.

Each node in the DAG holds a list of direct elements and a list of child nodes. The contents of the depset are the transitive elements, i.e. the direct elements of all the nodes. A new depset can be created using the depset constructor: it accepts a list of direct elements and another list of child nodes.

s = depset(["a", "b", "c"])
t = depset(["d", "e"], transitive = [s])

print(s)    # depset(["a", "b", "c"])
print(t)    # depset(["d", "e", "a", "b", "c"])

To retrieve the contents of a depset, use the to_list() method. It returns a list of all transitive elements, not including duplicates. There is no way to directly inspect the precise structure of the DAG, although this structure does affect the order in which the elements are returned.

s = depset(["a", "b", "c"])

print("c" in s.to_list())              # True
print(s.to_list() == ["a", "b", "c"])  # True

The allowed items in a depset are restricted, just as the allowed keys in dictionaries are restricted. In particular, depset contents may not be mutable.

Depsets use reference equality: a depset is equal to itself, but unequal to any other depset, even if they have the same contents and same internal structure.

s = depset(["a", "b", "c"])
t = s
print(s == t)  # True

t = depset(["a", "b", "c"])
print(s == t)  # False

d = {}
d[s] = None
d[t] = None
print(len(d))  # 2

To compare depsets by their contents, convert them to sorted lists.

s = depset(["a", "b", "c"])
t = depset(["c", "b", "a"])
print(sorted(s.to_list()) == sorted(t.to_list()))  # True

There is no ability to remove elements from a depset. If this is needed, you must read out the entire contents of the depset, filter the elements you want to remove, and reconstruct a new depset. This is not particularly efficient.

s = depset(["a", "b", "c"])
t = depset(["b", "c"])

# Compute set difference s - t. Precompute t.to_list() so it's not done
# in a loop, and convert it to a dictionary for fast membership tests.
t_items = {e: None for e in t.to_list()}
diff_items = [x for x in s.to_list() if x not in t_items]
# Convert back to depset if it's still going to be used for union operations.
s = depset(diff_items)
print(s)  # depset(["a"])

Order

The to_list operation performs a traversal over the DAG. The kind of traversal depends on the order that was specified at the time the depset was constructed. It is useful for Bazel to support multiple orders because sometimes tools care about the order of their inputs. For example, a linker action may need to ensure that if B depends on A, then A.o comes before B.o on the linker’s command line. Other tools might have the opposite requirement.

Three traversal orders are supported: postorder, preorder, and topological. The first two work exactly like tree traversals except that they operate on DAGs and skip already visited nodes. The third order works as a topological sort from root to leaves, essentially the same as preorder except that shared children are listed only after all of their parents. Preorder and postorder operate as left-to-right traversals, but note that within each node direct elements have no order relative to children. For topological order, there is no left-to-right guarantee, and even the all-parents-before-child guarantee does not apply in the case that there are duplicate elements in different nodes of the DAG.

# This demonstrates different traversal orders.

def create(order):
  cd = depset(["c", "d"], order = order)
  gh = depset(["g", "h"], order = order)
  return depset(["a", "b", "e", "f"], transitive = [cd, gh], order = order)

print(create("postorder").to_list())  # ["c", "d", "g", "h", "a", "b", "e", "f"]
print(create("preorder").to_list())   # ["a", "b", "e", "f", "c", "d", "g", "h"]
# This demonstrates different orders on a diamond graph.

def create(order):
  a = depset(["a"], order=order)
  b = depset(["b"], transitive = [a], order = order)
  c = depset(["c"], transitive = [a], order = order)
  d = depset(["d"], transitive = [b, c], order = order)
  return d

print(create("postorder").to_list())    # ["a", "b", "c", "d"]
print(create("preorder").to_list())     # ["d", "b", "a", "c"]
print(create("topological").to_list())  # ["d", "b", "c", "a"]

Due to how traversals are implemented, the order must be specified at the time the depset is created with the constructor’s order keyword argument. If this argument is omitted, the depset has the special default order, in which case there are no guarantees about the order of any of its elements (except that it is deterministic).

Performance

To see the motivation for using depsets, consider what would have happened if we had implemented get_transitive_srcs() without them. A naive way of writing this function would be to collect the sources in a list.

def get_transitive_srcs(srcs, deps):
  trans_srcs = []
  for dep in deps:
    trans_srcs += dep[FooFiles].transitive_sources
  trans_srcs += srcs
  return trans_srcs

However, this does not take into account duplicates, so the source files for a will appear twice on the command line and twice in the contents of the output file.

The next alternative is using a general set, which can be simulated by a dictionary where the keys are the elements and all the keys map to True.

def get_transitive_srcs(srcs, deps):
  trans_srcs = {}
  for dep in deps:
    for file in dep[FooFiles].transitive_sources:
      trans_srcs[file] = True
  for file in srcs:
    trans_srcs[file] = True
  return trans_srcs

This gets rid of the duplicates, but it makes the order of the command line arguments (and therefore the contents of the files) unspecified, although still deterministic.

Moreover, both this approach and the list-based one are asymptotically worse than the depset-based approach. Consider the case where there is a long chain of dependencies on Foo libraries. Processing every rule requires copying all of the transitive sources that came before it into a new data structure. This means that the time and space cost for analyzing an individual library or binary target is proportional to its own height in the chain. For a chain of length n, foolib_1 ← foolib_2 ← … ← foolib_n, the overall cost is effectively the triangle sum 1 + 2 + … + n, which is O(n^2). This cost is wasteful because the library rule’s behavior is not actually affected by the transitive sources.

Generally speaking, depsets should be used whenever you are accumulating more and more information through your transitive dependencies. This helps ensure that your build scales well as your target graph grows deeper. The exact advantage will depend on how deep the target graph is and how many elements per target are added.

To actually get the performance advantage, it’s important to not retrieve the contents of the depset unnecessarily in library rules. One call to to_list() at the end in a binary rule is fine, since the overall cost is just O(n). It’s when many non-terminal targets try to call to_list() that we start to get into quadratic behavior.

The performance page also contains information about using depsets efficiently.