Neat Little Tricks by Anup Bishnoi

Why is Bazel so hard to understand and use?

Essentially, Bazel is hard because it generates outputs in a parallel file system hierarchy, instead of putting outputs in the same folder as source code.

How does generating outputs in a parallel file hierarchy make things hard?

Well, build tools process files, and they expect to execute in the folder where source files are present.

While they don’t care whether the input files are hand-written source files or generated outputs themselves, they do often assume all input files to be present together in the source tree.

Bazel breaks this assumption hard.

How do you mean?

Well, say, you use Webpack to bundle JavaScript files. You will probably simply provide Webpack a source folder full of source JavaScript files, and Webpack will do its thing.

For library code which cannot be found in the source tree, the language runtime provides for a way to use packages from some other place in the file system, like the node_modules folder in NodeJS.

Same for other build tools which process files to generate outputs: Give them a source folder, tell them what to do at a high-level, and they do their thing by munching on all those files to create the outputs you want.

The way this works is that your source files can depend on each other by some language-level construct, like import statements, and the build tool can walk that graph to find the files that need to be compiled, bundled or simply processed.

For tools that don’t know how to walk the dependency graph for a file, you can specify those dependencies to the build tool in some other way, outside of the source files, as part of a Makefile, for example.

And this works out fine, regardless of whether the input files are hand-written source or generated by a previous build step, because the build tool only cares that all the input files are present in the source folder, not how they arrived there.

How does Bazel break this assumption?

Things get interesting for build systems when a future build step expects the output files generated by a prior build step as input. Let’s call such files “generated inputs”.

Generated inputs: Files generated by an earlier build step, but expected as input by a future one.

Generated input files are nothing new. In a Makefile, for example, shell commands (that act as build tools) don’t care whether an input file was a hand-written source file or a generated input file, it just executes on its specified inputs when Make tells it to, and Make takes care to not re-generate an output if the relevant inputs didn’t change.

This is called incremental compilation, and it’s basically “memoization”, but for build steps. Make has been doing it since 1976.

Bazel tries to one-up the game in one particular aspect. Bazel is not happy with cached incremental compilation (i.e., build memoization) alone. See below.

But generated inputs are a huge pain in Bazel because those files don’t live in the same folder hierarchy as the input source files, but in a completely different parallel folder hierarchy in a completely different part of the file system (inside Bazel cache).

And that makes things hard for build tools which expected all inputs to be co-located. The difficulty is mainly to do with the fact that, in Bazel, input files can either be source files and come from the source tree, or be generated inputs and come from Bazel’s output cache (also called bazel-bin).

Bazel provides helpers that convert simpler relative paths for generated inputs into absolute file system paths, so that the build tool using that as an input can refer to it accurately, instead of assuming it to be co-located with other source inputs, but that doesn’t always work so well either.

Can you share a concrete example of the problem here?

Say you have TypeScript source files at src/*, that depend on other generated TypeScript files, src/protos.js and src/protos.d.ts.

You will need to do this if you’re generating a TypeScript protobuf client, or WebAssembly code from C++ source files, as well as among other use cases.

Now, let’s assume, you want to build the project with Webpack, using ts-loader Webpack plugin.

You know why? Because that’s about the only simple way to get Webpack dev server working with live reload. Life is not pretty in Bazel/Webpack/TypeScript land.

Now, say, your project is located at my/awesome/app in your monorepo, and you want to import protos for use. What do you do? Does the following work from src/index.ts?

import protos from './protos';

Nope, it doesn’t. You know why? Because the generated protos.js file is not put in the source tree, it’s made available at the following relative path:

import protos from '../../../bazel-out/k8-fastbuild/bin/my/awesome/app/src/protos';

Not pretty.

Well you can make it prettier, at least the point of usage by adding a baseUrl setting to your TypeScript config file at my/awesome/app/tsconfig.json, and import it as:

import protos from 'protos';

But in order for this to work, your tsconfig.json file needs to list protos as a path mapping with the following path:

{
    ...
    "paths": {
        "protos": "../../../bazel-out/k8-fastbuild/bin/my/awesome/app/src/protos"
    }
}

Only then can you have with the generated protos.js file as an input to it, and using Bazel helpers that provide the absolute output path for protos.js.

And, of course, you shouldn’t actually depend on the hardcoded k8-fastbuild string in there because that changes with the specific bazel flags you used to build it, so instead you have to actually generate that tsconfig.json itself using Bazel.

Oh but wait, a generated tsconfig.json will be generated inside bazel-out/k8-fastbuild/bin/..., instead of in your source tree at src/tsconfig.json. Great.

Well, you can still provide that generated tsconfig.json’s path to ts-loader-webpack-plugin as an argument, using bazel helpers to convert it to its absolute path inside bazel output cache, but hey what’s that?

Now your IDE isn’t able to resolve the types to those protos? What? Yeah.

IDEs (and IDE plugins) don’t understand that Bazel’s output cache is this separate thing where they may find relevant stuff that they would have otherwise found at the same corresponding path in the source tree.

I’ll repeat: life is not pretty in Bazel/Webpack/TypeScript land.

Phew, wow. But, well, why does Bazel generate outputs somewhere else then?

In order to guarantee hermeticity, which is the idea that the same input always generates the exact same output, no matter when or on what machine you run the build.

Very “functional”, in the programming language sense of the term, but for Build systems! It’s remarkable, really, but makes Bazel hard to understand and use.

In fact, because source files are not mixed with generated ones, you can simply delete a folder, $HOME/.cache/bazel or something inside it, to get rid of all intermediate build cache, and recompile from scratch.

Separating outputs from source tree entirely, and with stricter write permissions, pretty much ensures that modifications happen only in source files, and any output file can be safely considered to be an idempotent output of its own inputs.

In fact, for the sake of hermeticity, Bazel wants to ensure that build tools only see a sub-section of the entire, potentially huge, source tree.

In order to do that, in addition to the separate parallel folder hierarchy for the output cache, it also needs another separate parallel folder hierarchy to act as the current working directory for build tool processes when they execute. That’s the only way to ensure that the build tool sees nothing in the working directory where it executes other than the explicitly specified inputs meant for it.

Why? Hermiticity.

Fine. But how can this even be improved? How would you make this better?

Hold that thought. Let me write that up in a new post.

Trigger warning: it involves building a new kind of file system.

Reach out on Twitter, LinkedIn, or email if you don’t like waiting.