Functional-navigational programming in Clojure(Script) with Specter

In February I open-sourced a library called Specter, and in my own work it has become by far my most-used library. It has changed the way I approach some fundamental aspects of programming, namely how I interact with and manipulate my program's data. I call the approach I take now "functional-navigational programming". I'm not the first one to come up with these ideas, nor is it a full-fledged paradigm in the sense of object-oriented or functional programming. But I give it a name because these techniques have changed the way I go about structuring huge amounts of my code. The best part is the abstractions used in this approach are not only concise and elegant – but also have performance rivaling hand-optimized code.

One of Clojure's greatest strengths is its powerful facilities for doing immutable programming: persistent data structures and a standard library that incorporates immutable programming at its core. Where Clojure's standard library gives you difficulty is dealing with composite immutable data structures, like a map of lists of maps. This is incredibly common, and I've run into it over and over in my years of programming Clojure. You're forced to write code that not only finds and manipulates the subvalue you care about, but also reconstructs the rest of the input data structure in the process.

Much more powerful than having getters and setters for individual data structures is having navigators into those data structures. Navigators can be composed arbitrarily, allowing you to concisely manipulate composite data structures of arbitrary sophistication. Let's look at an example to illustrate this difference. Suppose you're writing a program whose state looks something like this:

(def world
  {:people [{:money 129827 :name "Alice Brown"}
            {:money 100 :name "John Smith"}
            {:money 6821212339 :name "Donald Trump"}
            {:money 2870 :name "Charlie Johnson"}
            {:money 8273821 :name "Charlie Rose"}
   :bank {:funds 4782328748273}}

This data structure contains information about a bank and its list of customers. Notice that customers are indexed by the order in which they joined the bank, not by their names.

Now suppose you want to do a simple transformation that transfers money from a user to the bank. This code is ugly but also typical of Clojure code that deals with composite data structures:

(defn user->bank [world name amt]
  (let [;; First, find out how much money that user has
        ;; to determine whether or not this is a valid transfer
        curr-funds (->> world
                        (filter (fn [user] (= (:name user) name)))
   (if (< curr-funds amt)
     (throw (IllegalArgumentException. "Not enough funds!"))
     ;; If valid, then need to subtract the transfer amount from the
     ;; user and add the amount to the bank
     (-> world
           (fn [user-list]
             ;; Important to use mapv to maintain the type of the 
             ;; sequence containing the list of users. This code
             ;; modifies the user matching the name and keeps
             ;; every other user in the sequence the same.
             (mapv (fn [user]
                     ;; Notice how nested this code is that manipulates the users          
                     (if (= (:name user) name)
                       (update user :money #(+ % amt))
                       ;; If a user doesn't match the name during the scan,
                       ;; don't modify them
           [:bank :funds]
           #(- % amt))

There's a lot of problems with this code:

  • Not only does it need to do the appropriate credit and deduction, it also needs to reconstruct the world data structure it traversed on its way to the manipulated values. This logic is spread throughout the function.
  • The code is nested and difficult to read.
  • This function is specific to only one particular kind of transfer. There are many other kinds of transfers you may want to do: bank to a user, bank to many users, users to users, and so on. Each one of these functions would be burdened with the same necessity of navigating and reconstructing the data structure.

A better approach

Of course, there's a far better approach. Let's take a look at a generic transfer function that uses Specter to do a many-to-many transfer of a fixed amount between any two sets of entities. To be clear on the semantics of this function:

  • If the bank, Bob, and Alice transfer $50 to Jim and Sally, then Jim and Sally each receive $150 while the bank, Bob, and Alice each lose $100.
  • If any of the transferring entities lack sufficient funds, an error is thrown.

Here is the implementation:

(defn transfer
  "Note that this function works on *any* world structure. This handles
   arbitrary many to many transfers of a fixed amount without overdrawing anyone"
  [world from-path to-path amt]
  (let [;; Get the sequence of funds for all entities making a transfer
        givers (select from-path world)

        ;; Get the sequence of funds for all entities receiving a transfer
        receivers (select to-path world)

        ;; Compute total amount each receiver will be credited
        total-receive (* amt (count givers))

        ;; Compute total amount each transferrer will be deducted
        total-give (* amt (count receivers))]

    ;; Make sure every transferrer has sufficient funds
    (if (every? #(>= % total-give) givers)
      (->> world
           ;; Deduct from transferrers
           (transform from-path #(- % total-give))
           ;; Credit the receivers
           (transform to-path #(+ % total-receive))
      (throw (IllegalArgumentException. "Not enough funds!"))

The keys to this code are the "select" and "transform" functions. They utilize the concept of a "path" which identifies elements within a data structure that should be queried or manipulated. Let's hold off for a second on the details of what those paths look like and make some observations about this transfer function:

  • It's extremely generic. It handles fixed many-to-many transfers between any sets of entities.
  • It's easy to read and elegant.
  • Unlike the first example, this code is agnostic to the details of the "world" data structure. This works with any representation of the world.
  • It's very fast. Even though it's so much more generic than the initial user->bank function, it only executes slightly slower for that one particular use case.

This is some of the power of functional-navigational programming. How to get to your data is separated from what you want to do with it. This allows for generic and powerful abstractions like the transfer function.

Of course, the transfer function is only as powerful as the paths that can be passed to it. So let's take a quick detour to explore the concept of a "path" within a data structure. You'll see that they're extremely flexible and allow you to navigate in a very fine-grained way.

Core concepts of Specter

A path is just a list of steps for how to navigate into a data structure. That path can then be used to either query for subvalues or to do a transformation of a data structure. For example, if your data structure is a list of maps, here's code that increments all even values for :a keys:

(transform [ALL :a even?]
            [{:a 2 :b 3} {:a 1} {:a 4}])
;; => [{:a 3 :b 3} {:a 1} {:a 5}]

First, the "ALL" selector navigates to every map in the sequence. For each map, the ":a" keyword navigates to the value for that key within every map. Then, the "even?" function only stays at values which are even. After the selector is the "transform function" which takes in each value navigated to and returns its replacement value.

To understand how this code works its helpful to walk through how the data flows from step to step. First you start off with the input data structure:

[{:a 2 :b 3} {:a 1} {:a 4}]

ALL navigates to each element of the sequence, continuing the navigation from each element independently:

{:a 2 :b 3}
{:a 1}
{:a 4}

The :a keyword navigates to the value of that keyword for each element, leading to:


Then, the even? function only stays navigated at values which match the filter. This removes 1 from the navigated values, leaving:


Now Specter has reached the end of navigation, so it applies the update function to every value:


Now it's time to reconstruct the original data structure with these changes applied. To do this the navigators are traversed in reverse. The even? function brings back any values which it filtered out before:


The :a keyword replaces the values for :a in the original maps with the new values:

{:a 3 :b 3}
{:a 1}
{:a 5}

Finally, the ALL keyword puts everything back together in a sequence of the same type of the original sequence:

[{:a 3 :b 3} {:a 1} {:a 5}]

And that completes this transformation. Let's take a look at another example. This one increments the last odd number in a sequence of numbers:

(transform [(filterer odd?) LAST]
           [2 1 3 6 7 4 8])
;; => [2 1 3 6 8 4 8]

"(filterer odd?)" navigates to a view of the sequence that only contains the odd numbers. "LAST" navigates to the last element of that sequence. When the data structure is reconstructed, only the last odd number is incremented.

Let's look at the data flow for this transformation as well. The transformation starts with the input data structure:

[2 1 3 6 7 4 8]

The (filterer odd?) navigator filters the sequence for odd numbers. It also remembers to which index in the original sequence each of the filtered numbers came from. This will be used later during reconstruction.

[1 3 7]

The LAST navigator simply takes the last value of the sequence:


This is the end of navigation, so the update function is applied:


Now Specter works backwards through the navigators to reconstruct the data structure. LAST replaces the last value of its input sequence:

[1 3 8]

(filterer odd?) uses the index map it made before to set the values of its input sequence at the appropriate indices:

[2 1 3 6 8 4 8]

That's how you end up with the final result.

The next example reverses the positions of all the even numbers between indices 4 and 11:

(transform [(srange 4 11) (filterer even?)]
           [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15])
;; => [0 1 2 3 10 5 8 7 6 9 4 11 12 13 14 15]

"srange" navigates to the subsequence bound by the two specified indices. The "reverse" function receives all the odd numbers between those two indices, and then Specter reconstructs the original data structure with the appropriate changes. This example is nifty because writing it by hand is actually quite difficult.

Let's take a look at doing queries using Specter. Here's how to get every number divisible by three out of a sequence of sequences:

(select [ALL ALL #(= 0 (mod % 3))]
        [[1 2 3 4] [] [5 3 2 18] [2 4 6] [12]])
;;=> [3 3 18 6 12]

"select" always returns a sequence of results because paths can select many elements. In this case two ALL's are needed because there are two levels of sequences in this data structure.

As you can see, each component of a path specifies one step of a navigation. What's powerful is that these individual steps can be composed together any which way, arbitrarily. This allows you to specify queries and transformations of immense sophistication.

At the core of Specter is a protocol for specifying one step of navigation. It looks like this:

(defprotocol StructurePath
  (select* [this structure next-fn])
  (transform* [this structure next-fn])

Every single selector you've seen so far is defined in terms of this protocol. For example, here's how keywords implement it:

(extend-type clojure.lang.Keyword
  (select* [kw structure next-fn]
    (next-fn (get structure kw)))
  (transform* [kw structure next-fn]
    (assoc structure kw (next-fn (get structure kw)))

The protocol has one method for doing selects and another for doing transforms. In the select case, the "next function" finishes the selection from whatever values this step navigates to. In the transform case, the "next function" will transform any value this step navigates to, and the step is responsible for incorporating any transformed subvalues into the original data structure. As you can see from this example, the StructurePath implementation for keywords perfectly captures what it means to navigate within a data structure by a keyword.

Back to the bank example

Now that you've seen how paths work within Specter, I'll demonstrate how flexible this abstraction is with a variety of different kinds of transfers on the original bank example.

Here's how to get every person to pay a $1 fee to the bank:

(defn pay-fee [world]
  (transfer world
            [:people ALL :money]
            [:bank :funds]

Here's how to have every person receive $1 from the bank. The arguments are simply reversed as you would expect:

(defn bank-give-dollar [world]
  (transfer world
            [:bank :funds]
            [:people ALL :money]

Here's a function that returns a path to a particular user. It scans through all users and only selects those matching the given name. This function can be used to do transfers involving particular users.

(defn user [name]
   #(= (:name %) name)])

Later on, you'll see that there's a better way to implement "user" that allows for much better performance. For now, here's a function that transfers between two users:

(defn transfer-users [world from to amt]
  (transfer world
            [(user from) :money]
            [(user to) :money]

And here's a function to implement the initial example, transferring money from a user to the bank:

(defn user->bank [world from amt]
  (transfer world
            [(user from) :money]
            [:bank :funds]

Finally, here's a function to give a $5000 "loyalty bonus" to the oldest three users of the bank:

(defn bank-loyal-bonus [world]
  (transfer world
            [:bank :funds]
            [:people (srange 0 3) ALL :money]

As you can see, Specter can navigate through data structures in a very diverse set of ways. And what you've seen so far is just the tip of the iceberg: see the README to see more of the selectors that come with Specter.

Without Specter, implementing each of these transformations would have been tedious and repetitive – each would have been burdened with precisely reconstructing anything in the input data structure it didn't touch. But by having a few simple navigators and composing them together, each specific transformation can be handled very easily. This is the crux of functional-navigational programming: better a handful of generic navigators than a lot of specific operations.

Achieving high performance with precompilation

Using Specter as shown actually won't get you very good performance – interpreting those paths is quite costly. But the good news is that with a slight amount more effort, you can get performance that's 5-10x better and rivals hand-optimized code.

Most of the cost of running a select or transform is interpreting those paths, especially when the data structure being manipulated is small and the individual navigation operations are cheap. So Specter allows you to precompile your paths to achieve much higher perfomance by stripping away all the overhead. Here's a precompiled version of one of the previous examples:

(def compiled-path (comp-paths ALL :a even?))
(transform compiled-path
           [{:a 2 :b 3} {:a 1} {:a 4}])

Precompiled paths act just like any other navigator and can be composed with other navigators. If you know for sure that your path is going to be precompiled, you can use the compiled-select and compiled-transform functions to squeeze out even more performance.

Let's take a look at some basic microbenchmarks to see how good Specter's performance is. Here are five different ways to get a value out of a many-nested map. The benchmark function times how long it takes to run its input function that many times.

(def DATA {:a {:b {:c 1}}})
(def compiled-path (comp-paths :a :b :c))

(benchmark 1000000 #(get-in DATA [:a :b :c]))
;; => "Elapsed time: 77.018 msecs"

(benchmark 1000000 #(select [:a :b :c] DATA))
;; => "Elapsed time: 4143.343 msecs"

(benchmark 1000000 #(select compiled-path DATA))
;; => "Elapsed time: 63.183 msecs"

(benchmark 1000000 #(compiled-select compiled-path DATA))
;; => "Elapsed time: 51.964 msecs"

(benchmark 1000000 #(-> DATA :a :b :c vector))
;; => "Elapsed time: 34.235 msecs"

You can see what a huge difference precompilation makes, giving almost a 100x improvement for this particular use case. The fully compiled Specter execution is also more than 30% faster than get-in, one of Clojure's few built-in functions for dealing with nested data structures! Finally, the last example shows how long it takes to run the equivalent selection with direct, inlined code. Specter's not too far off, especially when you consider how high-level of an abstraction it is.

Let's now look at a benchmark for transforms. Here are five different ways to increment the value in that nested map:

(benchmark 1000000 #(update-in DATA [:a :b :c] inc))
;; => "Elapsed time: 1037.94 msecs"

(benchmark 1000000 #(transform [:a :b :c] inc DATA))
;; => "Elapsed time: 4305.429 msecs"

(benchmark 1000000 #(transform compiled-path inc DATA))
;; => "Elapsed time: 184.593 msecs"

(benchmark 1000000 #(compiled-transform compiled-path inc DATA))
;; => "Elapsed time: 169.841 msecs"

(defn manual-transform [data]
  (update data
          (fn [d1]
            (update d1
                    (fn [d2]
                      (update d2 :c inc))))))
(benchmark 1000000 #(manual-transform DATA))
;; => "Elapsed time: 161.945 msecs"

Once again, precompilation brings massive performance improvements. In this case, the comparison against Clojure's built-in equivalent update-in is even more dramatic: Specter is over 5x faster. Even more striking, the last benchmark measures a hand-written implementation, and Specter's performance is extremely close to it.

Precompile anywhere, anytime

Up until a few weeks ago, this was the extent of Specter's story. Specter could precompile paths and achieve great performance if the path was known statically. In the 0.7.0 release though, Specter gained a new capability that allows it to precompile any path at any time, even if the path requires parameters which aren't available yet. This lets you use Specter's very high level of abstraction with great performance in all situations. Since the problem Specter solves is so common, with this new capability I'm now comfortable referring to Specter as Clojure's missing piece.

Let's take a look at compiling paths that don't yet have their parameters. Earlier you saw this example that reverses the position of all even numbers in a subsequence:

(transform [(srange 4 11) (filterer even?)]
           [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15])
;; => [0 1 2 3 10 5 8 7 6 9 4 11 12 13 14 15]

Let's say you want a function that encapsulates this behavior but takes in the indices and the filtering predicate as parameters. An attempt without precompilation would look like this:

(defn reverse-matching-in-range [aseq start end predicate]
  (transform [(srange start end) (filterer predicate)]

Because there's no precompilation, there's a lot of overhead in running this function. To precompile this without its parameters, you can do this:

(let [compiled-path (comp-paths srange (filterer pred))]
  (defn reverse-matching-in-range [aseq start end predicate]
    (compiled-transform (compiled-path start end predicate)

The compiled path takes in parameters equal to the sum of the parameters its path elements require. And since all the precompilation optimizations are applied, this code executes very fast.

We can now come back to the bank example and make an efficient implementation of the user->bank function in terms of Specter. All you have to do is take advantage of the ability to precompile paths without their parameters, like so:

(def user
  (comp-paths :people
              (paramsfn [name]
                        (= name (:name elem)))

(def user-money (comp-paths user :money))

(def BANK-MONEY (comp-paths :bank :funds))

(defn user->bank [world name amt]
  (transfer world (user-money name) BANK-MONEY amt))

That's all there is to it! Converting uncompiled paths to compiled paths is always a straightforward refactoring.


Specter has very close similarities to prior work, especially lenses in Haskell. I'm not intimately familiar with Haskell lenses, so I'm not sure if they're entirely equivalent. Specter has other features that weren't discussed in this post (discussed in the README) that I'm not sure are in Haskell. Any clarification from Haskell experts out there would be welcome.

The functional-navigational approach leverages the power of composition to produce more concise and declarative code. In my own work I have selectors for navigating graphs in a variety of ways: in topological order, to a subgraph (with the ability to replace the subgraph with a new subgraph, with metadata indicating how to reattach the edges to the surrounding graph), to other nodes via outgoing or incoming edges, and so on. By focusing on making generic navigators, rather than functions for the specific transformations I need, I'm able to define the transformations I need for particular cases via the composition of my generic navigators. Since the graph navigators compose with all the other navigators you've seen, the possibilities are endless (The graph navigators are a little tied to my own datatypes, so I haven't open-sourced them yet. But they are surprisingly easy to implement – only about 150 lines of code. I would love to see someone contribute a specter-graph library).

And that pretty much summarizes the functional-navigational approach. Instead of thinking in terms of specific transformations, you make generic navigators that compose to your specific transformations – plus a heck of a lot more. My major accomplishment with Specter was figuring out how to make this all blazing fast within a dynamic language.

I've loved using Clojure for the majority of my work the past five years, and Specter makes that experience even better. To me Specter really does feel like Clojure's missing piece, and I strongly believe every single Clojure/ClojureScript programmer will benefit from using it.


History of Apache Storm and lessons learned

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It's crazy to think that four years ago Storm was nothing more than an idea in my head, and now it's a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

The topics I will cover through Storm's history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm's history.

Before Storm

Storm originated out of my work at BackType. At BackType we built analytics products to help businesses understand their impact on social media both historically and in realtime. Before Storm, the realtime portions of our implementation were done using a standard queues and workers approach. For example, we would write the Twitter firehose to a set of queues, and then Python workers would read those tweets and process them. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing.

We were very unsatisfied with this approach. It was brittle – we had to make sure the queues and workers all stayed up – and it was very cumbersome to build apps. Most of the logic we were writing had to do with where to send/receive messages, how to serialize/deserialize messages, and so on. The actual business logic was a small portion of the codebase. Plus, it didn't feel right – the logic for one application would be spread across many workers, all of which were deployed separately. It felt like all that logic should be self-contained in one application.

The first insight

In December of 2010, I had my first big realization. That's when I came up with the idea of a "stream" as a distributed abstraction. Streams would be produced and processed in parallel, but they could be represented in a single program as a single abstraction. That led me to the idea of "spouts" and "bolts" – a spout produces brand new streams, and a bolt takes in streams as input and produces streams as output. They key insight was that spouts and bolts were inherently parallel, similar to how mappers and reducers are inherently parallel in Hadoop. Bolts would simply subscribe to whatever streams they need to process and indicate how the incoming stream should be partitioned to the bolt. Finally, the top-level abstraction I came up with was the "topology", a network of spouts and bolts.

I tested these abstractions against our use cases at BackType and everything fit together very nicely. I especially liked the fact that all the grunt work we were dealing with before – sending/receiving messages, serialization, deployment, etc. would be automated by these new abstractions.

Before embarking on building Storm, I wanted to validate my ideas against a wider set of use cases. So I sent out this tweet:

A bunch of people responded and we emailed back and forth with each other. It became clear that my abstractions were very, very sound.

I then embarked on designing Storm. I quickly hit a roadblock when trying to figure out how to pass messages between spouts and bolts. My initial thoughts were that I would mimic the queues and workers approach we were doing before and use a message broker like RabbitMQ to pass the intermediate messages. I actually spent a bunch of time diving into RabbitMQ to see how it be used for this purpose and what that would imply operationally. However, the whole idea of using message brokers for intermediate messages didn't feel right and I decided to sit on Storm until I could better think things through.

The second insight

The reason I thought I needed those intermediate brokers was to provide guarantees on the processing of data. If a bolt failed to process a message, it could replay it from whatever broker it got the message from. However, a lot of things bothered me about intermediate message brokers:

  1. They were a huge, complex moving part that would have to be scaled alongside Storm.
  2. They create uncomfortable situations, such as what to do when a topology is redeployed. There might still be intermediate messages on the brokers that are no longer compatible with the new version of the topology. So those messages would have to be cleaned up/ignored somehow.
  3. They make fault-tolerance harder. I would have to figure out what to do not just when Storm workers went down, but also when individual brokers went down.
  4. They're slow. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party, and not only that, the messages need to be persisted to disk.

I had an instinct that there should be a way to get that message processing guarantee without using intermediate message brokers. So I spent a lot of time pondering how to get that guarantee with spouts and bolts passing messages directly to one another. Without intermediate message persistence, it was implied that retries would have to come from the source (the spout). The tricky thing was that the failure of processing could happen anywhere downstream from the spout, on a completely different server, and this would have to be detected with perfect accuracy.

After a few weeks of thinking about the problem I finally had my flash of insight. I developed an algorithm based on random numbers and xors that would only require about 20 bytes to track each spout tuple, regardless of how much processing was triggered downstream. It's easily one of the best algorithms I ever developed and one of the few times in my career I can say I would not have come up with it without a good computer science education.

Once I figured out this algorithm, I knew I was onto something big. It massively simplified the design of the system by avoiding all the aforementioned problems, along with making things way more performant. (Amusingly, the day I figured out the algorithm I had a date with a girl I'd met recently. But I was so excited by what I'd just discovered that I was completely distracted the whole time. Needless to say, I did not do well with the girl.)

Building first version

Over the next 5 months, I built the first version of Storm. From the beginning I knew I wanted to open source it, so I made some key decisions in the early days with that in mind. First off, I made all of Storm's APIs in Java, but implemented Storm in Clojure. By keeping Storm's APIs 100% Java, Storm was ensured to have a very large amount of potential users. By doing the implementation in Clojure, I was able to be a lot more productive and get the project working sooner.

I also planned from the beginning to make Storm usable from non-JVM languages. Topologies are defined as Thrift data structures, and topologies are submitted using a Thrift API. Additionally, I designed a protocol so that spouts and bolts could be implemented in any language. Making Storm accessible from other languages makes the project accessible by more people. It makes it much easier for people to migrate to Storm, as they don't necessarily have to rewrite their existing realtime processing in Java. Instead they can port their existing code to run on Storm's multi-language API.

I was a long time Hadoop user and used my knowledge of Hadoop's design to make Storm's design better. For example, one of the most aggravating issues I dealt with from Hadoop was that in certain cases Hadoop workers would not shut down and the processes would just sit there doing nothing. Eventually these "zombie processes" would accumulate, soaking up resources and making the cluster inoperable. The core issue was that Hadoop put the burden of worker shutdown on the worker itself, and for a variety of reasons workers would sometimes fail to shut themselves down. So in Storm's design, I put the burden of worker shutdown on the same Storm daemon that started the worker in the first place. This turned out to be a lot more robust and Storm has never had issues with zombie processes.

Another problem I faced with Hadoop was if the JobTracker died for any reason, any running jobs would terminate. This was a real hair-puller when you had jobs that had been running for many days. With Storm, it was even more unacceptable to have a single point of failure like that since topologies are meant to run forever. So I designed Storm to be "process fault-tolerant": if a Storm daemon is killed and restarted it has absolutely no effect on running topologies. It makes the engineering more challenging, since you have to consider the effect of the process being kill -9'd and restarted at any point in the program, but it makes things far more robust.

A key decision I made early on in development was to assign one of our interns, Jason Jackson, to develop an automated deploy for Storm on AWS. This massively accelerated Storm's development, as it made it easy for me to test clusters of all different sizes and configurations. I really cannot emphasize enough how important this tool was, as it enabled me to iterate much, much faster.

Acquisition by Twitter

In May of 2011, BackType got into acquisition talks with Twitter. The acquisition made a lot of sense for us for a variety of reasons. Additionally, it was attractive to me because I realized I could do so much more with Storm by releasing it from within Twitter than from within BackType. Being able to make use of the Twitter brand was very compelling.

During acquisition talks I announced Storm to the world by writing a post on BackType's tech blog. The purpose of the post was actually just to raise our valuation in the negotiations with Twitter. And it worked: Twitter became extremely interested in the technology, and when they did their tech due-diligence on us, the entire due-diligence turned into a big demo of Storm.

The post had some surprising other effects. In the post I casually referred to Storm as "the Hadoop of realtime", and this phrase really caught on. To this day people still use it, and it even gets butchered into "realtime Hadoop" by many people. This accidental branding was really powerful and helped with adoption later on.

Open-sourcing Storm

We officially joined Twitter in July of 2011, and I immediately began planning Storm's release.

There are two ways you can go about releasing open source software. The first is to "go big", build a lot of hype for the project, and then get as much exposure as possible on release. This approach can be risky though, since if the quality isn't there or you mess up the messaging, you will alienate a huge number of people to the project on day one. That could kill any chance the project had to be successful.

The second approach is to quietly release the code and let the software slowly gain adoption. This avoids the risks of the first approach, but it has its own risk of people viewing the project as insignificant and ignoring it.

I decided to go with the first approach. I knew I had a very high quality, very useful piece of software, and through my experience with my first open source project Cascalog, I was confident I could get the messaging right.

Initially I planned to release Storm with a blog post, but then I came up with the idea of releasing Storm at a conference. By releasing at a conference:

  1. The conference would help with marketing and promotion.
  2. I would be presenting to a concentrated group of potential early adopters, who would then blog/tweet/email about it all at once, massively increasing exposure.
  3. I could hype my conference session, building anticipation for the project and ensuring that on the day of release, there would be a lot of eyes on the project.

So releasing at a conference seemed superior in all respects. Coincidentally, I was scheduled to present at Strange Loop that September on a completely different topic. Since that was when I wanted to release Storm, I emailed Alex, the Strange Loop organizer, and changed my session to be the release of Storm. As you can see from the session description, I made sure to use the Twitter brand in describing Storm.

Next, I began the process of hyping Storm. In August of 2011, a little over a month before the conference, I wrote a post on Twitter's tech blog announcing that Storm would be released at Strange Loop. In that post I built excitement for Storm by showing a lot the details of how Storm works and by giving code examples that demonstrated Storm's elegance. The post had the effect I wanted and got people really excited.

The next day I did something which I thought was really clever. I started the mailing list for Storm:

Here's why I think that was clever. A key issue you have to deal with to get adoption for a project is building social proof. Social proof exists in many forms: documented real-world usage of the project, Github watchers, mailing list activity, mailing list subscribers, Twitter followers, blog posts about the project, etc. If I had started the mailing list the day I released the project, then when people looked at it, it would have shown zero activity and very few subscribers. Potentially the project would be popular immediately and the mailing list would build social proof, but I had no guarantee of that.

By starting the mailing list before release, I was in a situation of arbitrage. If people asked questions and subscribed, then I was building social proof. If nothing happened, it didn't matter because the project wasn't released yet.

A mistake I made in those early days, which is bizarre since I was working at Twitter, was not starting a Twitter account for the project. Twitter's a great way to keep people up to date about a project as well as constantly expose people to the project (through retweets). I didn't realize I should make a Twitter account until well after release, but fortunately it didn't turn out to be that big of a deal. If I could do it again I would have started the Twitter account the same day I made the mailing list.

Between the time I wrote the post on Twitter's tech blog and the start of Strange Loop, I spent the majority of my time writing documentation for Storm. This is the single most important thing I did for the project. I wrote about 12,000 words of carefully thought out documentation – tutorials, references, API docs, and so on. A lot of open source developers don't realize how crucial docs are: people cannot use your software if they don't understand it. Writing good documentation is painful and time-consuming, but absolutely essential.

The moment of truth came on September 19th, 2011. I had fun with the release. I started my talk by saying I had been debating whether to open source Storm at the beginning of my talk, starting things off with a bang, or the end of my talk, finishing on an exclamation point. I said I decided to get the best of both worlds by open sourcing Storm right in the middle of my talk. I told the audience the time of the exact middle of my talk, and told them to shout out at me if I hadn't open sourced it by that time. As soon as that moment came, the audience shouted at me and I released the project.

Everything went according to plan. The project got a huge amount of attention and got over 1000 Github watchers on the first day. The project went to #1 on Hacker News immediately. After my talk, I went online and answered questions on Hacker News, the mailing list, and Twitter.

The aftermath of release

Within four days Storm became the most watched Java, Scala, or Clojure project on Github. Within two weeks, announced they already had Storm in production. I thought that was incredible and a testament to the high quality of the project and docs at release.

As soon as Storm was released I started getting feedback from people using the project. In the first week I made three minor releases to address quality of life issues people were having. They were minor but I was focused on making sure everyone had the best experience possible. I also added a lot of additional logging into Storm in that first week so that when people ran into issues on the mailing list, they could provide me more information on what was going on.

I didn't anticipate how much time it would take to answer questions on the mailing list. The mailing list had a ton of activity and I was spending one to two hours a day answering questions. Part of what made the mailing list so time-consuming is how bad most people are at asking questions. It's very common to get a question like this: "I'm having a lot of tuple failures. Why??" Most of the time there's an easy fix as typically the user had something strange going on with how they configured or were using Storm. But I would have to spend a ton of time asking the user follow-up questions to get them to provide me the information they already had so I could help them. You'd be amazed at how often a user fails to tell you about something really bizarre they did, like running multiple versions of Storm at once, manually editing the files Storm daemons keep on local disk, running their own modified version of Storm, or using a shared network drive for the state of Storm daemons. These endless hours I spent on the mailing list became very draining (especially since at the same time I was building a brand new team within Twitter) and I wouldn't get relief for well over a year.

Over the next year I did a ton of talks on Storm at conferences, meetups, and companies. I believe I did over 25 Storm talks. It got to a point where I could present Storm with my eyes closed. All this speaking got Storm more and more exposure.

The marketing paid off and Storm acquired production users very quickly. I did a survey in January of 2012 and found out Storm had 10 production users, another 15 planning to have it in production soon, and another 30 companies experimenting with the technology. To have that many production users for a major piece of infrastructure in only 3 months since release was very significant.

I set up a "Powered By" page for Storm to get that last piece of critical social proof going. Rather than just have a list of companies, I requested that everyone who listed themselves on that page include a short paragraph about how they're using it. This allows people reading that page to get a sense of the variety of use cases and scales that Storm can be used for. I included a link to my email on that page for people who wanted to be listed on it. As I did the tech talk circuit, that page continued to grow and grow.

Filling the "Powered By" page for a project can be frustrating, as there can be a lot of people using your project that you're not aware of. I remember one time I got an email from one of the biggest Chinese companies in the world asking to be listed on Storm's Powered By page. They had been using Storm for over a year at that point, but that whole time I had no idea. To this day I don't know the best way to get people to tell you they're using your software. Besides the link to my email on Storm's Powered By page, the technique I've used is to occasionally solicit Powered By submissions via Twitter and the mailing list.

Technical evolution of Storm

Storm is a far more advanced project now than when it was released. On release it was still very much oriented towards the needs we had at BackType, as we had not yet learned the needs of larger companies for major infrastructure. Getting Storm in shape to be deployed widely within Twitter drove its development for the next 1.5 years after release.

The technical needs of a large company are different than a startup. Whereas at a startup a small team manages the entire stack, including operations and deployment, in a big company these functions are typically spread across multiple teams. One thing we learned immediately within Twitter is that people didn't want to run their own Storm clusters. They just wanted a Storm cluster they could use with someone else taking care of operations.

This implied that we needed to be able to have one large, shared cluster running many independent applications. We needed to ensure that applications could be given guarantees on how many resources they would get and make sure there was no possible way one application going haywire would affect other applications on the cluster. This is called "multi-tenancy".

We also ran into process issues. As we built out the shared cluster, we noticed that pretty much everyone was configuring their topologies to use a huge number of resources – way more than they actually needed. This was making usage of the cluster very inefficient. The problem was that no one had an incentive to optimize their topologies. People just wanted to run their stuff and have it work, so from their perspective there was no reason not to request a ton of resources.

I solved both these issues by developing something called the "isolation scheduler". It was an incredibly simple solution that provided for multi-tenancy, created incentives for people to use resources efficiently, and allowed a single cluster to share both production and development workloads.

As more and more people used Storm within Twitter, we also discovered that people needed to monitor their topologies with metrics beyond what Storm captures by default. That led us to developing Storm's excellent metrics API to allow users to collect completely custom, arbitrary metrics, and send those metrics to any monitoring system.

Another big technical jump for Storm was developing Trident, a micro-batching API on top of Storm that provides exactly-once processing semantics. This enabled Storm to be applied to a lot of new use cases.

Besides all these major improvements, there were of course tons of quality of life improvements and performance enhancements along the way. All the work we were doing allowed us to do many releases of Storm – we averaged more than one release a month that first year. Doing frequent releases is incredibly important to growing a project in the early days, as each release gives you a boost of visibility from people tweeting/talking about it. It also shows people that the project is continuously improving and that if they run into issues, the project will be responsive to them.

Building developer community

The hardest part about building an open source project is building the community of developers contributing to the project. This is definitely something I struggled with.

For the first 1.5 years after release, I drove all development of Storm. All changes to the project went through me. There were pros and cons to having all development centered on me.

By controlling every detail of the project, I could ensure the project remained at a very high quality. Since I knew the project from top to bottom, I could anticipate all the ways any given change would affect the project as a whole. And since I had a vision of where the project should go, I could prevent any changes from going in that conflicted with that vision (or modify them to be consistent). I could ensure the project always had a consistent design and experience.

Unfortunately, "visionary-driven development" has a major drawback in that it makes it very hard to build an active and enthusiastic development community. First off, there's very little room for anyone to come in and make major contributions, since I am controlling everything. Second, I am a major bottleneck in all development. It became very, very hard to keep up with pull requests coming in (remember, I was also building a brand new infrastructure team within Twitter at the same time). So people would get discouraged from contributing to the project due to the incredibly slow feedback/merge cycle.

Another drawback to centering development on myself was that people viewed me as a single point of failure for the project. People brought up concerns to me of what would happen if I got hit by a bus. This concern actually limited the project less than you would think, as Storm was adopted by tons of major companies while I was at the center of development, including Yahoo!, Groupon, The Weather Channel, WebMD, Cerner, Alibaba, Baidu, Taobao, and many other companies.

Finally, the worst aspect to centering development on myself was the burden I personally felt. It's a ton of pressure and makes it hard to take a break. However, I was hesitant to expand control over project development to others because I was worried about project quality suffering. There was no way anyone else would have the deep understanding I did of the entire code base, and inevitably that would lead to changes going in with unintended consequences. However, I began to realize that this is something you have to accept when expanding a developer community. And later on I would realize this isn't as big of a deal as I thought.

Leaving Twitter

When I left Twitter in March of 2013 to pursue my current startup, I was still at the center of Storm development. After a few months it became a priority to remove myself as a bottleneck to the project. I felt that Storm would be better served with a consensus-driven development model.

I think "visionary-driven development" is best when the solution space for the project hasn't been fully explored yet. So for Storm, having me controlling all decisions as we built out multi-tenancy, custom metrics, Trident, and the major performance refactorings was a good thing. Major design issues can only be resolved well by someone with a deep understanding of the entire project.

By the time I left Twitter, we had largely figured out what the solution space for Storm looked like. That's not to say there wasn't lots of innovation still possible – Storm has had a lot of improvements since then – but those innovations weren't necessarily surprising. A lot of the work since I left Twitter has been transitioning Storm from ZeroMQ to Netty, implementing security/authentication, improving performance/scalability, improving topology visualizations, and so on. These are all awesome improvements but all of which were already anticipated as directions for improvements back in March of 2013. To put it another way, I think "visionary-driven development" is necessary when the solution space still has a lot of uncertainty in it. When the solution space is relatively well understood, the value of "visionary-driven development" diminishes dramatically. Then having that one person as a bottleneck seriously inhibits growth of the project.

About four months before leaving Twitter, Andy Feng over at Yahoo! started pushing me hard to submit Storm to Apache. At that point I had just started thinking about how to ensure the future of Storm, and Apache seemed like an interesting idea. I met with Doug Cutting, the creator of Hadoop, to get his thoughts on Apache and potentially moving Storm to Apache. Doug gave me an overview of how Apache works and was very candid about the pros and cons. He told me that the incubator could be chaotic and would most likely be painful to get through (though in reality, it turned out to be an incredibly smooth process). Doug's advice was invaluable and he really helped me understand how a consensus-driven development model works.

In consensus-driven development, at least how its done by many Apache projects, changes are voted into a project by a group of "committers". Typically all changes require at least two +1 votes and no -1 votes. That means every committer has veto power. In a consensus-driven project, not every committer will have a full understanding of the codebase. Many committers will specialize in different portions of the codebase. Over time, some of those committers will learn a greater portion of the codebase and achieve a greater understanding of how everything fits together.

When Storm first transitioned to a consensus-driven model, most of the committers had relatively limited understandings of the codebase as a whole and instead had various specialized understandings of certain areas. This was entirely due to the fact I had been so dominant in development – no one had ever been given the responsibility where they would need to learn more to make good decisions. By giving other people more authority and stepping back a bit, my hope was that others would fill that void. And that's exactly what happened.

One of my fears when moving to consensus-driven development was that the quality of changes would drop. And indeed, some of the changes that went in as we transitioned had some bugs in them. But this isn't a big deal. Because you'll get a bug report, and you can fix the problem for the next release. And if the problem is really bad, you can cut an emergency release for people to use. When I was personally making all development decisions, I would thoroughly test things myself and make use of my knowledge of the entire codebase such that anything that went out in a release was extremely high quality. But even then, my code sometimes had bugs in it and we would have to fix them in the next release. So consensus-driven development is really no different, except that changes may require a bit more iteration to iron out the issues. No software is perfect – what's important is that you have an active and responsive development community that will iterate and fix problems that arise.

Submitting to Apache

Getting back to the history of Storm, a few months after leaving Twitter I decided I wanted Storm to move to a consensus-driven development model. As I was very focused on my new startup, I also wanted Storm to have a long-term home that would give users the confidence that Storm would be a thriving project for years to come. When I considered all the options, submitting Storm to Apache seemed like far and away the best choice. Apache would give Storm a powerful brand, a strong legal foundation, and exactly the consensus-driven model that I wanted for the project.

Using what I learned from Doug Cutting, I eased the transition into Apache by identifying any legal issues beforehand that would cause problems during incubation. Storm made use of the ZeroMQ library for inter-process communication, but unfortunately the licensing of ZeroMQ was incompatible with Apache Foundation policy. A few developers at Yahoo! stepped up and created a replacement based on Netty (they all later became Storm committers).

In forming the intitial committer list for Storm, I chose developers from a variety of companies who had made relatively significant contributions to the project. One person who I'm super glad to have invited as a committer was Taylor Goetz, who worked at Health Market Science at the time. I was on the fence about inviting him since he hadn't contributed much code at that point. However, he was very active in the community and mailing list so I decided to take a chance on him. Once becoming a committer, Taylor took a huge amount of initiative, relieving me of many of the management burdens of the project. During incubation he handled most of the nitty-gritty stuff (like taking care of certain legal things, figuring out how to move the website over to Apache, how to get permissions for new committers, managing releases, calling for votes, etc.). Taylor later went to Hortonworks to work on Storm full-time, and he did such a good job helping shepherd Storm through the incubator that he is now the PMC chair for the project.

In September of 2013, with the help of Andy Feng at Yahoo!, I officially proposed Storm for incubation in Apache. Since we were well-prepared the proposal went through with only some minor modifications needed.

Apache incubation

During incubation we had to demonstrate that we could make releases, grow the user community, and expand the set of committers to the project. We never ran into any problems accomplishing any of these things. Once Storm was in incubation and I was no longer a bottleneck, development accelerated rapidly. People submitting patches got feedback faster and were encouraged to contribute more. We identified people who were making significant contributions and invited them to be committers.

Since incubation I've been just one committer like any other committer, with a vote no stronger than anyone else. I've focused my energies on any issues that affect anything core in Storm or have some sort of difficult design decision to work out. This has been a much more efficient use of my time and a huge relief compared to having to review every little change.

Storm officially graduated to a top-level project on September 17th, 2014, just short of three years after being open-sourced.


Building Storm and getting it to where it is now was quite a ride. I learned that building a successful project requires a lot more than just producing good code that solves an important problem. Documentation, marketing, and community development are just as important. Especially in the early days, you have to be creative and think of clever ways to get the project established. Examples of how I did that were making use of the Twitter brand, starting the mailing list a few months before release, and doing a big hyped up release to maximize exposure. Additionally, there's a lot of tedious, time-consuming work involved in building a successful project, such as writing docs, answering the never-ending questions on the mailing list, and giving talks.

One of the most amazing things for me has been seeing the huge range of industries Storm has affected. On the Powered By page there are applications listed in the areas of healthcare, weather, analytics, news, auctions, advertising, travel, alerting, finance, and many more. Reading that page makes the insane amount of work I've put into Storm feel worth it.

In telling this story, I have not been able to include every detail (three years is a long time, after all). So I want to finish by listing many of the people who have been important to Storm getting to the point it is at today. I am very grateful to all of these people: Chris Aniszczyk, Ashley Brown, Doug Cutting, Derek Dagit, Ted Dunning, Robert Evans, Andy Feng, Taylor Goetz, Christopher Golda, Edmund Jackson, Jason Jackson, Jeff Kaditz, Jennifer Lee, Michael Montano, Michael Noll, Adrian Petrescu, Adam Schuck, James Xu, and anyone who's ever contributed a patch to the project, deployed it in production, written about it, or given a presentation on it.

You should follow me on Twitter here.


The Entrepreneur Who Captivated Me

On January 12th, 2012 I went to Airbnb to give a talk about Storm at their tech talk series. It was a public event and quite a few people came. As usual at these events, there was mingling/socializing after the talk. A bunch of people approached me and one of these people was Jake Klamka.

There's nothing striking about Jake Klamka when you first meet him. He's a pretty unassuming fellow. So when he approached me after my talk, little did I know I was about to meet an astonishingly good entrepreneur – someone who's a master at getting help from others. Over the next few years, little by little, he would get me to help him more and more. And I would be amazed at what an awesome company he would build.

People approach me all the time with various requests. And the vast majority of them go about it the wrong way. For example, I often get emails – from complete strangers that I don't know – asking to meet with me for an hour or have a conference call with me. The reason they want to meet is typically vague – "I want to find out more about Storm" or "I'd like to get your advice on dealing with Big Data". In the past I have taken some of these meetings, and they are almost always a waste of time for both of us. They typically come to me with generic questions that could easily have been answered by doing a Google search, reading my blog, or watching one of my talks. I want to help people, but coming to me with questions that you could easily have answered yourself with a slight amount of research is not productive.

Jake puts a lot more thought into how he asks for help. A lot of what I learned about getting help from others I learned from how Jake slowly reeled me in to helping him. When he approached me at the Airbnb meetup, he didn't start with what he was working on and what he wanted from me. On the contrary, he asked about me and what I was up to. He told me he'd been following my work for awhile and asked me how I liked working at Twitter post-acquisition (I was part of BackType which was acquired by Twitter 6 months before). He asked me about how I came up with Storm. He did not attempt to shift the conversation to himself at all. A clever fellow, this Jake.

Naturally, I felt compelled to know more about this person who was flattering me by showing so much interest in my life. So I asked him what he was up to and what he was working on. He smoothly described to me how he had gone through Y Combinator but was now working on something new. He told me about his idea to form a fellowship program (which he later called Insight Data Science) to train PhD's with physics, math, and other backgrounds the skills necessary to become data scientists. The idea was that the PhD's were a surplus of very smart people without the greatest job prospects, and at the same time there was huge unsatisfied demand for data scientists in the industry. The PhD's already had the statistical skills necessary for data science – they just needed to learn the tooling of the industry. It was a smooth pitch and I told him I thought it was a pretty interesting idea.

In retrospect, I believe that he came prepared to pitch me on this idea, to gauge my interest, and to see if he could get me involved in helping out with the program. But here's the key: he didn't force himself on me. Rather than approach me and pitch me, he took advantage of natural social dynamics and waited until I asked about him. Not until then did he give his pitch. And not until I told him I liked his idea did he make his ask.

Jake asked if he could come by Twitter for lunch one day and go over his plans for Insight. It was a minimal commitment and one which he made as convenient as possible for me (by coming to where I work). Unlike the requests for 1 hour meetings I get from completely random people, this was appealing because it was a request from someone who had demonstrated he was doing something interesting and who was asking for specific feedback. So I was glad to take the meeting.

Jake came by Twitter for lunch and was super prepared. He brought a slide deck on his iPad and very clearly took me through the specifics of how Insight would work, how he was going to recruit students, and how he would connect them to industry. By making everything so tangible he made it incredibly easy for me to give him feedback. I told him it was important that he get companies involved in the program from the start because of how crucial it was to keep the students connected with the needs of industry. It was an extremely productive meeting and I felt like I provided a lot of value. I was impressed with Jake and outright told him I'd be happy to help him more in the future.

Over the next few months I helped Jake out with a few things – mostly related to connecting him to people within Twitter who might want to hire out of his program. That summer he launched the first batch of the program and asked if I would come down to Palo Alto to give a talk to the students. I was happy to do so.

That went well and the first batch of the program went on to be a huge success. Insight achieved – and continues to achieve – a 100% placement rate for all its fellows. Fellows get hired at all the top companies like Netflix, Twitter, Facebook, LinkedIn, Square, Microsoft, etc. It's a really impressive program. The most mind boggling part about it is that it's completely free for the fellows.

Jake and I occasionally kept in touch over the next year and a half. Then this February, he emailed me to update me on the program (it had grown a lot) and ask if he could meet with me again to talk about his idea for a new data engineering program. Unlike data science, this program would be geared towards people who want to build the pipelines underlying data products and the infrastructure that data scientists use. Like usual, Jake made the meeting as convenient as possible for me by coming to meet me near where I live in San Francisco.

I was skeptical about the program at first. While the data science program took advantage of a distinct imbalance – a surplus of smart PhD's and a deficit of data scientists – I wasn't entirely sure who were the target candidates for data engineering. However, as we talked I realized what a good idea this was. Data engineers are in huge demand, and getting skilled at data engineering is a ticket to a job that's not only high paying, but also supremely interesting – imagine playing with the datasets of Twitter, Netflix, Spotify, or Khan Academy. Everyone seems to have a "Big Data" problem now, and data engineers are crucial to solving those problems.

Besides that, it also became apparent that Insight is valuable in and of itself, not just as a ticket to a job in industry. Jake has created a cradle of creativity – the entire program is structured around fellows doing self-directed projects and helping each other learn and grow. The data engineering program seemed like a fantastic opportunity for any programmer to sharpen their skills and get an awesome job, whether a new grad or someone experienced looking to transition their career.

In our meeting Jake's biggest question mark was what kinds of people would make good fellows for the data engineering program. I felt strongly that you won't turn a non-programmer or an inexperienced programmer into a good programmer in 6 weeks, so the candidates he should go after should already be good programmers. The program should be focused on teaching fellows the tools and techniques of data engineering, not on teaching them basic engineering skills. This was a big difference from the data science program, and Jake and I spent a long time talking about how the interview process for Insight Data Engineering should work.

A few weeks later, Jake asked if I would become an official advisor to Insight. I accepted instantly. I found Insight to be hugely impressive, and Jake had long ago proven to me how resourceful and effective he is. Most importantly, it's always a joy to work with Jake because he makes sure that every meeting is productive. He asks for specific feedback and never asks overly vague questions. As an advisor to the program, I'm helping fellows with their projects and holding sessions to teach fellows various aspects of data engineering (like Lambda Architecture and Storm). I'm excited to see the program improve and evolve as it takes on more batches of students in the future.

It's fun to look back at how Jake reeled me in to helping out with Insight. He didn't just come up to me with a list of things he wanted – instead he first built my interest in him and what he was doing. Then he gradually escalated the commitments he asked for – first a lunch meeting where I work, then some introductions, then a visit to the Insight office, then help with evaluating candidates, then a regular commitment to spend time with the fellows. At each stage he made sure I was sufficiently interested such that taking on the commitment would be something I'm excited about – and not just a favor. That's what makes Jake such a great entrepreneur – he makes people who were total strangers want to go out of their way to help him. I think everyone can learn from Jake's example, as I certainly have.


Why we in tech must support Lawrence Lessig

I'm an entrepreneur and a programmer. I've been fortunate to work in an industry that has seen incredible growth the past 10 years. It's amazing that an entrepreneur can launch services that reach millions of people with very, very small amounts of capital. Startups can compete with established services on a level playing field because the internet does not discriminate between different services. The internet is neutral. This has enabled an explosion of services that has provided massive amounts of value to the entire world.

I'm not here to convince you of the importance of net neutrality. This has been done thoroughly here, here, here, and here. Instead I want to talk about a much deeper issue.

Losing net neutrality would be extremely harmful to our society and our economy, and it's not hard to see this. And yet, the government seems to have a lot of trouble understanding this. The government could fix this problem instantly by reclassifying the internet from an "information service" to a "telecommunications service". Why don't they?

Whenever I'm fixing a bug in a software system, it's important I get to the root cause of the bug. If I only fix one particular manifestation of the bug, the bug will just pop up again in a different form. Something similar is going on with our government. I believe that Lawrence Lessig has thoroughly and articulately elucidated the root cause of our government's seeming stupidity, and that anything less than fixing that root cause is a losing battle. But I'll get back to that.

Government "stupidity"

It's not just net neutrality. Our government has made what appears to be stupid decisions on many, many issues. Consider copyright extension. The government consistently and frequently extends copyrights on existing works, preventing those works from entering the public domain. It is wrong in every single case to extend a copyright, as the only reason for having copyrights is to incentivize the creation of works in the first place.

Now consider your taxes. Have you ever wondered why you have to fill out a tax return? For most people, the government already has all the information needed to automatically fill out your taxes. A bill was proposed in 2011 that would have made this a reality – to give you a pre-filled tax return that you could then adjust on your own if there were any problems. It would have been cheap to implement, voluntary to use, and would have saved billions of dollars in tax preparation costs and millions of hours in time. Pre-filled tax returns have been shown to work in other countries like Denmark, Sweden, and Spain. Yet, the bill died.

These are just simple and easy to understand examples, There are endless examples of terrible government decisions in climate change, healthcare, the lead up and response to the 2008 financial crisis, agricultural subsidies, and pretty much any area you can think of.

It's easy and convenient to conclude that these decisions are the result of members of our government being stupid. If that's the conclusion you've made, I don't blame you. It's a conclusion I've made in the past, and it made me completely apathetic to government. It made me believe that's just the way government is, and we have to work around that to live our lives.

The root cause

But let's consider an alternative explanation. It's not easy to become a representative or a senator. Getting elected is a brutally competitive game – you don't win by being stupid. You get there by being smart and cunning.

I believe that the decisions made by our elected officials are calculated and smart – from their perspective. They want you to think they're dumb, that that's just the way government works. The more apathy in the citizenry and the more people believe that's just the way things are, the less likely people will challenge the status quo.

That brings us back to Lawrence Lessig. I don't think anyone has done a better job at explaining what is going on with our government than he has.

His explanation is very simple. In order to get elected in America, you need a lot of money to run an election campaign. The vast majority of that campaign money comes from a tiny fraction of Americans - only 0.05% of people. If you don't get the support of those people, you won't have enough money to realistically run for office. So that tiny fraction of funders selects the people we vote for in elections. And naturally, the funders only select people who favor their interests.

So you end up with a government that represents the interests of the funders and not the people as a whole. You end up with a government full of people who are good at catering to the interests of people who can give them the money to get elected/re-elected. The system selects for those kinds of people.

This is how companies like Comcast and AT&T influence policies enacted by Congress. They spend tons of money lobbying to get laws passed to limit competition and preserve their monopoly status. They lobbied extremely heavily to get Congress to put pressure on the FCC to prevent them from classifying the Internet as a telecommunications service, which would have ensured net neutrality. This article and this video explain the connection between lobbying and net neutrality very well.

There is only one solution to this problem – funding for elections must be spread out so that "the funders" equals "the people". Lessig has many proposals for how to do this. One idea is for everyone to get a small voucher (say $50) to donate to candidates of their choice. That $50 comes out of the first $50 in taxes you paid that year. This would create many billions of campaign dollars every year, more than enough to spread the influence of money across the whole population.

Lessig has explained the influence of money in politics much better than I ever could, so check out his excellent TED talk and his extremely well-written book.

Our apathy is misguided

When I talk to people about this problem, almost no one disagrees. There's almost universal agreement that government is beholden to special interests, that the way campaigns are financed is hugely corrupting. And yet almost everyone is apathetic and believes it's impossible to change the system. I'm reminded of George Orwell's "Animal Farm". After enough time, a way of life becomes the way of life. You forget that things can work differently, that things have worked differently in the past.

I don't want to live in a world where the government is constantly trying to trade off the public good for the benefit of a special few. Instead I want to live in a world where the government represents and works for the people. I know, that sounds naive. But if you don't strive for it, you most assuredly won't achieve it. Much, much crazier things have happened.

This issue is so important and touches so many aspects of our society that I believe it's our duty as citizens to fight for change any way we can. We have to support people who are working day and night on this, who have excellent ideas on how to achieve reform. Lawrence Lessig is certainly one of those people.

Supporting Lawrence Lessig

Lawrence Lessig's latest initiative is called the MayDay PAC, the "SuperPAC to end all SuperPACs". The MayDay PAC is raising money to get people elected who are committed to eliminating the unbalanced influence of special interests. Once you get past the irony of it, it's actually a brilliant idea. Check it out and read the FAQ.

So yes, at the end of all this I'm asking you to make a political donation – $10, $100, or whatever you can afford. I'm not involved in politics in any way and am not affiliated with Lessig or the MayDay PAC. I'm just a concerned citizen. At worst, you'll be losing a couple bucks. At best, you'll be helping enable one of the most important reforms, if not the most important reform, in a generation.

If you're still not convinced, here are links to the MayDay PAC and many of the excellent talks and writings produced by Lessig:


The inexplicable rise of open floor plans in tech companies

Update: I originally quoted the average price of office space as $36 / square foot / month, where in reality it's per year. So I was accidentally weakening my own argument! The post has been updated to reflect the right number.

The "open floor plan" has really taken over tech companies in San Francisco. Offices are organized as huge open spaces with row after row of tables. Employees sit next to each other and each have their own piece of desk space. Now, I don't want to comment on the effectiveness of open floor plans for fields other than my own. But for software development, this is the single best way to sabotage the productivity of your entire engineering team.

The problem

Programming is a very brain-intensive task. You have to hold all sorts of disparate information in your head at once and synthesize it into extremely precise code. It requires intense amounts of focus. Distractions and interruptions are death to the productivity of a programmer. And an open-floor plan encourages distractions like nothing else.

First of all, you're in a room with dozens and dozens of other people. That's naturally going to be very noisy. People are talking all over the room. The person next to you is chewing on potato chips. You constantly hear people getting up and walking around. Hopefully you at least have carpeted floors, or else it's going to be REALLY loud. All day long doors to conference rooms are opening and closing. Noise breaks concentration. Broken concentration breaks productivity. If you're lucky you're one of those people who can work by drowning out the noise with music through your headphones. Otherwise, you're out of luck.

Even worse than the noise is the fact that you are very easy to interrupt in an open floor plan. People tap you on the shoulder to ask you questions. Now maybe you're different than me, but I find it pretty hard to focus when someone starts talking to me as I'm working. It's frequent enough in an open floor plan that even just the potential of that happening hurts my concentration.

There's evidence that open plan offices make it more likely for people to get sick. This is not really that surprising as with a big open space you'd expect germs to spread more easily. Besides that, the lack of privacy also bothers a lot of people.

I can't tell you how many times I've heard this comment from programmers: "I get most of my work done once most people have left the office and I can concentrate". This translates to "I can't do work during normal business hours!" Think about that. This is truly absurd. You should be able to work during working hours.

The "collaboration" justification

The most common justification I hear about the open floor plan is that it "encourages collaboration". Now it's true, the open floor plan does create the occasional opportunity for collaboration. You might overhear someone the row over talking about how they need to do some particular load testing of a database, and then you jump in with how you built such a tool for such a purpose. Everyone says "Hurrah!" and something truly valuable occurred. But in reality, these moments of true serendipitity are few and far between.

The tradeoff for these moments is that all your working hours are now sabotaged by non-stop distractions that ruin your productivity. The primary task of a programmer is writing code, which involves sitting at a desk and thinking and typing. Code is not written in these supposed spontaneous acts of collaboration. A working environment should make your primary tasks as easy as possible, which for programming means encouraging focus and concentration.

Cost-effectiveness of open floor plans

Let's be honest though. Open floor plans are done because it's the most cost-effective way to squeeze as many people into one space as possible. Space is expensive so you should make as best use of it as you can. But does minimizing space REALLY minimize cost? Because programmers aren't call center workers. They're very expensive, so sabotaging their productity is thereby very expensive. Let's play with some numbers to see how this plays out.

I'm not sure on the exact numbers, but in San Francisco a programmer probably costs you on average $100K a year in salary. With benefits the total cost is in the neighborhood of $120K. So a programmer is a $10K / month investment.

The average price per square foot per month of an office in San Francisco is $36 / year, or $3 / month. But let's say the average rate is $10 a month, since more expensive rates favor open floor plans, and I want to drive the point home. If you actually look at a sample of rates for New York and San Francisco, you'll see that in reality almost no offices are nearly as expensive as $10 / month.

With an open floor plan, let's say a programmer takes up an average of 6ft x 6ft of space. So the cost of space per programmer per month is 10 * 6 * 6 = $360 / month. This means the cost per programmer including space is $10360 / month.

Let's say that in a non open-floor plan each programmer requires four times the amount of space as an open-floor plan environment – an average of 12ft x 12ft. That's actually quite a lot of space. In this case, the cost of space per programmer per month results in $1440 / month, making the cost of a programmer including space $11440 / month. This makes a non-open-floor-plan programmer 10.1% more expensive than an open-floor-plan programmer, or put another way an open-floor-plan programmer is 9.2% less expensive than a non-open-floor plan programmer.

So on a per programmer basis, if the open floor plan lowers productivity by less than 9.2%, it's worth it. But this seems overly optimistic. In my experience working in an open floor plan my productivity is cut by half or worse. Plus there are things I literally am unable to do in such an office because they require too much focus. So my own estimate of my productivity decrease in such an office could be closer to a 75% decrease!

This analysis doesn't even take into account that if your programmers are more productive, you need less programmers. You only need half the programmers if your programmers are twice as effective. This drives your space needs in half, vastly skewing the numbers further in favor of non-open-floor plans. Unless my estimates of productivity decrease or space needs are way, way off, the open floor plan is not even close to worth it.


I don't know what the "best" arrangement is for an office for programmers. Perhaps it's 1 person offices, 2 person offices, or 3-5 person offices. Perhaps offices combined with open "collaboration areas" is the right approach. But certainly the open floor plan as practiced today is not it.

It might be possible to establish a culture that enforces a library-like environment on an open floor plan. If talking too loudly got you "shushed", then it would certainly be a lot quieter and easier to concentrate. I don't know of any company that has created a culture like this, but it's certainly an interesting idea. Personally though I think that creating a good work environment through physical means will be much more robust than doing so through cultural means.

Of course, having an alternative environment that allows for focus and concentration does not mean that spontaneous collaboration goes away. Because, surprise! – your employees still interact with each other at lunch, in the kitchen, in meetings, and in all the other natural places that people socialize.


Another thing that people like about the open floor plan is that it "looks good" and has the "startup feel". Well, I don't know about you, but to me the startup feel is about shipping quickly and getting things done. I would greatly, greatly prefer an office environment that helped me do that rather than get in my way.

The open floor plan really only works when you're really small, when it's essentially equivalent to one of those "5 man offices". But once you start to get bigger, say the 15 person range, it starts becoming unwieldy. Once you're big enough and have the resources to customize the office environment, I think it's incredibly important to find an office environment that works better.

Here's an idea. Establish a culture that encourages employees to work from home as much as they want. They should really understand that face time is completely irrelevant. Then, measure how many people come in each day. Do polls to see how productive people find the office. If the numbers are low, then there's something wrong with your office environment. This puts the burden on you, as the employer, to make an office environment that people actually want to work in.

You should follow me on Twitter here.


Interview with "Programmer Magazine"

I was recently interviewed for "Programmer Magazine", a Chinese magazine. The interview was published in Chinese, but a lot of people told me they'd like to see the English version of the interview. Due to the Google translation being, ahem, a little iffy, I decided to just publish the original English version on my blog. Hope you enjoy!

What drew you to programming and what was the first interesting program you wrote?

I started programming when I was 10 years old on my TI-82 graphing calculator. Initially I started programming because I wanted to make games on my calculator – and also because I was bored in math class :D. The first interesting game I made on my calculator was an archery game where you'd shoot arrows at moving targets. You'd get points for hitting more targets or completing all the targets faster. A couple years later I graduated to programming the TI-89 which was a huge upgrade in power. I remember how the TI-82 only let you have 26 variables (for the characters 'a' through 'z') and thinking how incredible it was that the TI-89 let you have as many variables as you want.

What do you do to improve your skills as a programmer?

I get better by doing a lot of programming and trying new things. One of the best ways to become a better programmer is to learn new programming languages. By learn I mean more than just learning the syntax of the language, I mean understanding the language's idioms and writing something substantial in it. For me, learning Clojure made me a much better programmer in all languages.

Could you talk about your experience before joining BackType?

I got my bachelor's and master's in Computer Science at Stanford University with a focus on software theory. So I did a lot of algorithms and proofs and so on. Probably the best thing I did at Stanford was choose classes not so much by the subject material but by the professor. When I found a professor who was a great teacher I would take as many classes with that professor as possible. For example, one of the greatest teachers I've ever had is Professor Tim Roughgarden. I took a bunch of "algorithmic game theory" classes with him – algorithmic game theory is basically the intersection of economics and computer science. I took the classes not so much for the material but to improve my problem solving skills. Professor Roughgarden had an incredibly coherent and disciplined way of breaking down extremely difficult problems and making them easy to understand. Learning those skills has made me a much better problem solver in all scenarios, as well as being a much better communicator of difficult concepts.

You once said, leaving Twitter is a tough decision, could you please tell us why you decide start your own company? What object do you want to achieve?

I had a pretty great situation at Twitter, having my own team and working full-time on a project I started. But when I thought of the idea for my company, it was so compelling I just couldn't stop thinking about it. So I felt that if I didn't start this company, I would regret it for the rest of my life.

What are the main lessons you learned in the last few years of your professional career?

Feedback is everything. Most of the time you're wrong, and feedback is the only way to realize your mistakes and help you become less wrong. This applies to everything. In product development, get your product out there as soon as possible so you can get feedback and see what works and what doesn't. In many cases you don't even need to build anything – a link to a "feature" that actually goes to a survey page can give you the feedback you need to test your idea.

In managing a team, it's really important to have feedback on all the processes you do. At BackType we'd have a once a month meeting to discuss our processes and whether they are effecive or too restrictive. This caused us to introduce standups, and then remove standups when we didn't feel they were that useful to us. We used that process to go from monthly meetings to biweekly meetings to weekly meetings, then back to biweekly meetings.

In your blog you said "I'm always happy to give advice or connect with people doing interesting work", what interesting projects you have seen, and which suggestions you provide?

The founder of Insight Data Science approached me when he was starting it, and I think it's an absolutely terrific program. They provide a 6 week bootcamp to help math/science/physics PhD's learn programming skills so they can start a career in data science. Basically the program recognizes that there is a surplus of very smart people who don't necessarily have the most interesting job prospects, while there is a booming tech industry with a huge talent shortage of data scientists. So they bridge that gap. I was able to help them out with a couple things and I think their execution has been very impressive.

What prompted you to write the book Big Data and what problems you want to solve? Writing a book will take a long process, what you have learned during this process?

I had developed a lot of theory and best practices about architecting big data systems that no one else was talking about. People were focused on very specific use cases, whereas I had developed rigorous, holistic approaches. A lot of the things I talk about, like being resilient to human error (something I consider to be absolutely non-negotiable) are ignored by the vast majority of industry. I think the industry will be much better off by building these systems more rigorously and less haphazardly, and I felt that this book was the right way to effect that change.

I knew that writing a book would be a lot of work, but it turned about to be signicantly more work than I expected. I think my book is especially challenging because it's such a huge subject. At one point I had half the book written, but I realized I was taking the wrong approach to communicating the material so I scrapped everything and started over. It was definitely worth it though because based on the feedback I get from readers they love the material and really get what I'm trying to communicate.

My editors have been absolutely invaluable in the writing process and have helped me become a much better writer. I've learned that the way I was taught in school to write is actually the complete opposite of effective communication. I was taught to make your general "thesis" statement up front, and then drill down into that general statement with supporting points and eventually specific details. It turns out that this forces the reader to do a lot of work to synthesize everything you're saying. They won't grasp the thesis up front – because they haven't read the supporting points yet. So after drilling down the reader now has to drill back "up" to connect everything. It's a convoluted way to achieve an understanding of something. A much better way to communicate is to tell a story – start with a situation the reader already understands, and then connect step by step to the ultimate general statement you want your reader to understand. Specific to general is always better than general to specific.

You contributed a lot open source projects, what makes you believe in open source?

Open source benefits so many people in so many ways. When you're a startup, you're highly resource-constrained, so being able to take advantage of the work other people have done is a godsend. Lowering the cost of doing startups, of course, is highly beneficial to society. When you benefit that much from open source, you do feel obligated to give back as well. On top of that, when you open source software as a company, you benefit from other people trying it out, finding issues, and improving your software "for free".

On a personal level, open source has given me an opportunity to interact with an entire world of developers, rather than just those in whatever company I happened to be at. This has been hugely beneficial to my career, allowing me to get to know tons of awesome people and travel the world to speak at conferences.

Which person has influenced you the most?

Philosophically I'd have to say the most influential person to me is Carl Sagan. I've read most of his books and find them hugely inspirational. I think he was one of the greatest communicators of all time, and what impresses me most about him is his extreme empathy towards his audience. For example, he has quite a bit of writing about science vs. religion – but as a scientist he is not hostile towards religion or anything like that. He understands why people are religious and the value they get out of it. So when he communicates the value of science and skepticism to religious people he starts with religion = valuable as a starting point. That degree of empathy is really rare, and it's something that I'm continuously trying to improve at. He taught me that empathy is the basis of good communication.

What's the key points John McCarthy told you about his life and perspective? How these words affect the rest of your life, what is your own life and perspective?

I talked with John McCarthy for two hours when I was a sophomore in college. The most striking thing he told me was when I asked about the history of Lisp. He told me he needed a better programming language for doing AI research, so he invented Lisp for that purpose. He really didn't seem to care that much about programming languages – his real passion was AI. It struck me as exactly like how Isaac Newton invented calculus because he needed it for his physics work. The pebbles of giants really are big boulders.

When designing a software system, which process you will use? (the fist step, the second step...)

I think designing a software system is entirely about learning what to build as you go. I use a technique which I call "suffering-oriented programming" in order to maximize learning and minimize wasted work. I detailed this approach on my blog. The general idea is to avoid making "general" or "extensible" solutions until you have a very deep understanding of the problem domain. Instead you should hack things out very directly to get a working prototype as fast as possible. Then you iterate and evolve and learn more about the problem domain. Once you have a good understanding of the intricacies of the problem domain, then you can redesign your solution to make it more general, extensible, etc. Finally, at the end, you wrap things up by tightening up the code and making performance optimizations. The sequence is "First make it possible. Then make it beautiful. Then make it fast."

Do you have any principle when you programming?

I believe strongly in immutability and referentially transparent functions as ways to vastly simplify software. Mutability creates a web of dependencies in your code – things that can change other things which then change other things – which gets hard to wrap your head around. Code is all about being able to understand what's going on, so anything you do to make that easier is a good thing. Immutability is one such technique to reduce what you need to understand about a particular piece of code to grasp it. Additionally, referentially transparent functions only depend on their arguments (and not on any other state), so they are also easier to understand.

Another important principle I live by is "my code is wrong". I think it's pretty clear that we don't know how to make perfect software – all the code I've ever used or written has had bugs in it. So I assume my code is wrong and design it to work anyway (with higher probability, at least). I've detailed techniques to accomplish this on my blog and in my conference talks this year.

Compared with your early years, what is the biggest change when you programming today?

Since I started off programming graphing calculators, I'd say the biggest change is using full-fledged keyboards to program instead of those tiny keypads :)

Storm development is in a very short time, by few developers, under a limited budget and urgent requirements, what is your secret about how could be so efficient like this?

Storm is the result of following that "suffering-oriented programming" methodology. We didn't jump into Storm out of the blue – we had been doing realtime computation at BackType for a long time by stringing workers together manually with queues. So we had a really solid understanding of our needs for realtime processing. Storm at its core is just a simple set of abstractions with a clever algorithm behind the scenes to guarantee data processing. When I thought long and hard about the realtime problems we were dealing with at the time, the design of Storm was obvious. Additionally, I had a ton of experience with Hadoop and knew some of the mistakes made in the design of that system, so I applied that learning into Storm to make it more robust.

You have interviewed a lot of programmers, what are the best programmers in common do you think?

The best programmers are obsessed with improving as programmers. They love exploring new programming languages and new ideas. Another key trait of great programmers is a "getting stuff done" mentality. It's far more important to get something working than to make the perfect design. Plus a great programmer recognizes that you can't make a perfect design without first having something working that you can learn from.

Are there any myths (layman think something right, but expert do not think so) and traps in Data System and Big Data?

Probably the biggest misconception I see is people placing the relational database, and associate concepts like CRUD, on a pedestal. People treat the RDBMS as if its the ultimate in database technology, and everyone seems to be trying to recreate the RDBMS to work in Big Data land. But this ignores massive problems that have always existed with the RDBMS: they're based on mutability so are extremely susceptible to corruption whenever there's a bug or a human error, and they force you into a horrible situation of needing to either normalize your schema and take performance hits, or denormalize your schema and create a maintenance nightmare (among other problems). When you actually look at data systems from first principles, as I do in my book, you see that there's different ways of architecting data systems that have none of these complexities.

What problem you want to solve in BackType that lead you decide to start design Storm?

There were two problems. The first was how to keep our databases containing social media analytics stats up to date in realtime in a reliable way. The second was the "reach problem" – how to compute the "reach" of a URL on Twitter very quickly. The "reach" is the unique count of all the followers of all the people who tweeted a URL. It's very computationally intensive and hard to precompute. Storm turned out to be a simple abstraction which unified these seemingly unrelated use cases.

What reason or experience make you sure that you will successfully build Storm?

The key was that we had tons of experience with realtime computation so knew the problem domain very well. So there was really no question in my mind that Storm would be successful because I had already learned a majority of the little gotchas.

Why you choose Clojure as the development language of Storm? Could you talk about your long practical experience about using this language (like its advantages and disadvantages)? Which feature won't appear in the Storm, if you were not using Clojure?

Clojure is the best language I've ever used, by far. I use it because it makes me vastly more productive by allowing me to easily use techniques like immutability and functional programming. Its dynamic nature by being Lisp-based ensures that I can always mold Clojure as necessary to formulate the best possible abstractions. Storm would not be any different if I didn't use Clojure, it just would have been far more painful to build.

From your blog, I saw you advocate a lot about writing. Could you share us what do you do in improve your writing skill?

The only way to improve at writing is to write a lot. When other people read my writing and give me feedback, like by commenting on my blog, I carefully think about where that comment came from. If they misunderstood something then that means I'm not communicating correctly – either I'm not clear or I'm not properly anticipating reader objections (whether or not those objections are fallacious is irrelevant). By understanding why my message doesn't get through, I'm able to do a better job the next time.

I also read a lot and try to learn from great writers. As I've mentioned Carl Sagan is one of my favorite writers and I've learned tons from reading him – and I continue to learn tons from him everytime I read his work.

You start using Emacs recent years, could you talk about the programming tools you choose and how they impact you?

I started using Emacs because I found it to be the best environment for programming Clojure (due to its Lisp background). I've been really impressed with how powerful of a tool it is and how much it can be customized to my needs. On top of that, since it was originally written so long ago, it has an incredibly small resource footprint. That's something I really enjoy because modern IDE's tend to be such resource hogs.

Other than that, I think my setup is pretty simplistic. I use a live REPL in my Emacs for exploratory development and interactive testing. I also have tons of text files on my computer with design notes and ideas. For my todo list I literally just use a text file.


Break into Silicon Valley with a blog

I know a lot of non-technical people who would love to work in the venture-funded startup world, from consultants to finance people to other business types for which I'm not really sure exactly what it is they do. They hit obstacles trying to get into the startup world, finding that their skills are either irrelevant or hard to explain. My advice to all these people is the same:

Write a blog.

A blog can improve your life in enormous ways. Or to put it in business-speak: a blog has one of the highest ROI's of anything you can do.

Put yourself in the shoes of startups looking for talent. First off – startups are desperate for talent. The problem is that it's very difficult to identify great people – startups search through loads and loads of candidates.

Resumes and interviews only tell you so much about a person. It's really hard to stand out in a resume – you're not the only one putting over-inflated impressive-looking numbers and bullet points on your resume. And interviews are notorious for labeling bad people as good and good people as bad. So to maximize your odds of making it through the funnel, you need to show that you're awesome independent from the randomness of the normal process.

One thing you can do is write an insightful blog. This makes you look a lot more compelling. Now the reaction from startups will be "Hey, this person's really smart. We don't want to miss out on a potentially great hire, so let's put in a lot of effort to determine if there's a good fit."

A new dimension of opportunity

There's another huge advantage to having a blog besides being a mechanism to show that you're smart and insightful. A blog opens up a whole new dimension of opportunity for you. Instead of relying purely on outbound opportunities that you explicitly seek out yourself, you also will get inbound opportunities where people reach out to you with opportunities you never expected or dreamed of.

With an outbound opportunity you know exactly what you're seeking, whether it's landing a job or speaking at a conference or something else. Inbound opportunities, on the other hand, are highly uncertain. They come to you out of the blue. In my personal experience, many of the most awesome things I've done started as inbound opportunities: a book deal, flying all around the world for free to speak at conferences, a keynote at a major conference, and connecting with hundreds of awesome people who have reached out to me because of something I did or wrote publicly.

When you write a blog, you greatly increase the likelihood of getting awesome inbound opportunities. When it comes to breaking into Silicon Valley – instead of everything being on your shoulders to seek interesting companies, those companies will be reaching out to you.

A great phrase I've heard for this is "increasing your luck surface area". By providing value to people publicly, like writing insightful posts on a blog, you open yourself up to serendipitous, "lucky" opportunities.

Getting readers

Besides writing smart posts, you also need people to read your writing. Here's a few tips for accomplishing that.

First off, the title of a blog post is incredibly important. In very few words, you need to sell your potential reader that your blog post is going to be worth their time. I've found the the best titles are relevant to the potential reader, somewhat mysterious, and non-generic. Titles are definitely an art form, so you should think hard about how you'll name your posts. Sometimes I wait days to publish a post because I haven't thought of a compelling enough title.

Second, I highly recommend using Twitter as a distribution platform for your blog posts. The combination of Twitter and blogging leads to a beautifully virtuous cycle: your blog increases your Twitter following, and as your Twitter following grows you increase the reach of your blog. I consider Twitter to be the greatest professional networking tool ever devised – I follow people who tweet/blog interesting things and they follow me for the same reason. Then when I go to conferences I seek out the people who I know and respect from their online presence. When we meet, we already know a lot about each other and have a lot to talk about.

Lastly, you should embrace the online communities who will care about your blog. In Silicon Valley, the most important community is Hacker News. Hacker News is widely read in Silicon Valley by programmers, entrepreneurs, and investors. It can drive a lot of readers to your blog in a short amount of time.

Initially, it may be hard for you to get readers. Getting your posts on Hacker News is very much a crapshoot, and initially you'll have too small of a Twitter following to get that much distribution. But occasionally you'll write something smart that gets on Hacker News and gets shared around. Over time as your writing and distribution improves getting readers gets easier and easier.

What to write about

If you don't think you have anything to write about, then let me ask you a question. Do you really have that low of an opinion of yourself? Do you really think you have nothing interesting that you can share with the world? There's tons of stuff that you can write about that you don't even know to share. You have a ton of knowledge that you don't realize other people don't know because you spend all your time in your own head. Tell stories of times that you hustled. Write about the dynamics of big companies. Write case studies of anything related to running a business. Analyze the market for interesting new technologies (e.g. 3D printing, Bitcoin, etc). There's so much that you can write about.

Once you start blogging, you'll become attuned to random ideas you have throughout the day that would make good blog posts. Most of my blog ideas start off as email reminders to myself.

Final thoughts

If you haven't blogged before, you're going to suck at first. Being accurate, precise, and insightful is not enough. You have to learn how to hook people into your posts and keep the post engaging. You'll learn about the glorious world of internet commenting, where people constantly misinterpret what you say and apply very fallacious reasoning to your posts. You'll see people trash your ideas on Hacker News even though it's clear they didn't read your entire post. Sometimes they comment having only read the title! You'll learn over time different ways to structure the same information in order to minimize misinterpretation. You'll learn to anticipate fallacious reasoning and preemptively address those fallacies.

With writing, practice most definitely makes perfect. I sucked at writing at first, but I quickly improved.

A lot of people say they "don't have time to write." To be blunt, I think comments like this are the result of laziness and self-delusion. Writing a blog is really not that much work. You really can't find a couple hours to pump out a blog post? Just occasionally, instead of going out to the bar or seeing a movie or going surfing or doing whatever it is you do for fun, try writing. The potential benefits relative to the investment are MASSIVE. I haven't even discussed all the other benefits which on their own make blogging worthwhile.

Of course, writing isn't the only thing you can do to help yourself break into Silicon Valley. But it's an enormously easy way to make yourself stand out and open yourself to opportunities you never expected.

You should follow me on Twitter here.


Principles of Software Engineering, Part 1

This is the first in a series of posts on the principles of software engineering. There's far more to software engineering than just "making computers do stuff" – while that phrase is accurate, it does not come close to describing what's involved in making robust, reliable software. I will use my experience building large scale systems to inform a first principles approach to defining what it is we do – or should be doing – as software engineers. I'm not interested in tired debates like dynamic vs. static languages – instead, I intend to explore the really core aspects of software engineering.

The first order of business is to define what software engineering even is in the first place. Software engineering is the construction of software that produces some desired output for some range of inputs. The inputs to software are more than just method parameters: they include the hardware on which it's running, the rate at which it receives data, and anything else that influences the operation of the software. Likewise, the output of software is more than just the data it emits and includes performance metrics like latency.

I think there's a distinction between programming a computer and software engineering. Programming is a deterministic undertaking: I give a computer a set of instructions and it executes those instructions. Software engineering is different. One of the most important realizations I've had is that while software is deterministic, you can't treat it as deterministic in any sort of practical sense if you want to build robust software.

Here's an anectode that, while simple, hits on a lot of what software engineering is really about. At Twitter my team operated a Storm cluster used by many teams throughout the company for production workloads. Storm depends on Zookeeper to store various pieces of state relating to Storm's operation. One of the pieces of state stored is information about recent errors in application workers. This information feeds a Storm UI which users look at to see if their applications have any errors in them (the UI also showed other things such as statistics of running applications). Whenever an error bubbles out of application code in a worker, Storm automatically reports that error into Zookeeper. If a user is suppressing the error in their application code, they can call a "reportError" method to manually add that error information into Zookeeper.

There was a serious omission in this design: we did not properly consider how that reportError method might be abused. One day we suddenly received a flood of alerts for the Storm cluster. The cluster was having serious problems and no one's application was running properly. Workers were constantly crashing and restarting.

All the errors were Zookeeper related. I looked at the metrics for Zookeeper and saw it was completely overloaded with traffic. It was very strange and I didn't know what could possibly be overloading it like that. I took a look at which Zookeeper nodes were receiving the most API calls, and it turned out almost all the traffic was coming to the set of nodes used to store errors for one particular application running on the cluster. I shut that application down and the cluster immediately went back to normal.

The question now was why that application was reporting so many errors. I took a closer look at the application and discovered that all the errors being reported were null pointer exceptions – a user had submitted an application with a bug in it causing it to throw that exception for every input tuple. In addition, the application was catching every exception, suppressing it, and manually calling reportError. This was causing reportError to be called at the same rate at which tuples were being received – which was a lot.

An unfortunate interaction between two mistakes led to a major failure of a production system. First, a user deployed buggy, sloppy code to the cluster. Second, the reportError method had an assumption in it that errors were rare and thereby the amount of traffic to that method would be inconsequential. The user's buggy code broke that assumption, overloading Zookeeper and causing a cascading failure that took down every other application on the cluster. We fixed the problem by throttling the rate at which errors could be reported to Zookeeper: errors reported beyond that rate would be logged locally but not written to Zookeeper. This made reportError robust to high traffic and eliminated the possibility for cascading failure due to abuse of that functionality.

As this story illustrates, there's a lot of uncertainty in software engineering. You think your code is correct – yet it still has bugs in it. How your software is actually used differs from the model in your head when you wrote the code. You made all sorts of assumptions while writing the software, some of which are broken in practice. Your dependencies, which you use as a black box, randomly fail due to a misunderstanding of their functional input range. The most salient feature of software engineering is the degree to which uncertainty permeates every aspect of the construction of software, from designing it to implementing it to operating it in a production environment.

Learning from other fields of engineering

It's useful to look at other forms of engineering to learn more about software engineering. Take bridge engineering, for example. The output of a bridge is a stable platform for crossing a chasm. Even though a bridge is a static structure, there are many inputs: the weight of the vehicles crossing, wind, rain, snow, the occasional earthquake, and so on. A bridge is engineered to operate correctly under certain ranges of those inputs. There's always some magnitude of an earthquake for which a bridge will not survive, and that's deemed okay because that's such a low probability event. Likewise, most bridges won't survive being hit by a missile.

Software is similar. Software operates correctly only within a certain range of inputs. Outside those inputs, it won't operate correctly, whether it's failure, security holes, or just poor performance. In my Zookeeper example, the Zookeeper cluster was hit with more traffic than it could handle, leading to application failure. Similarly, a distributed database can only handle so many hardware failures in a short amount of time before failing in some respect, like losing data. That's fine though, because you tune the replication factor until the probability of such an event is low enough.

Another useful field to look at is rocket engineering. It takes a lot of failure and iteration to build a rocket that works. SpaceX, for example, had three failed rocket launch attempts before they finally reached orbit. The cause of failure was always something unexpected, some series of inputs that the engineers didn't account for. Rockets are filled to the brim with telemetry so that failures can be understood, learned from, and fixed. Each failure lets the engineers understand the input ranges to the rocket a little better and better engineer the rocket to handle a greater and greater part of the input space. A rocket is never finished – you never know when there will be some low probability series of inputs you haven't experienced yet that will lead to failure. STS-107 was the 113th launch of the Space Shuttle, yet it ended in disaster.

Software is very similar. Making software robust is an iterative process: you build and test it as best you can, but inevitably in production you'll discover new areas of the input space that lead to failure. Like rockets, it's crucial to have excellent monitoring in place so that these issues can be diagnosed. Over time, the uncertainty in the input space goes down, and software gets "hardened". SQL injection attacks and viruses are great examples of things that take advantage of software that operates incorrectly for part of its input space.

There's always going to be some part of the input space for which software fails – as an engineer you have to balance the probabilities and cost tradeoffs to determine where to draw that line. For all of your dependencies, you better understand the input ranges for which the dependencies operate within spec and design your software accordingly.

Sources of uncertainty in software

There are many sources of uncertainty in software. The biggest is that we just don't know how to make perfect software: bugs can and will be deployed to production. No matter how much testing you do, bugs will slip through. Because of this fact of software development, all software must be viewed as probabilistic. The code you write only has some probability of being correct for all inputs. Sometimes seemingly minor failures will interact in ways that lead to much greater failures like in my Zookeeper example.

Another source of uncertainty is the fact that humans are involved in running software in production. Humans make mistakes – almost every software engineer has accidentally deleted data from a database at some point. I've also experienced many episodes where an engineer accidentally launched a program that overloaded a critical internal service, leading to cascading failures.

Another source of uncertainty is what functionality your software should even have – very rarely are the specs fully understood and fleshed out from the get go. Instead you have to learn as you go, iterate, and refine. This has huge implications on how software should be constructed and creates tension between the desire to create reusable components and the desire to avoid wasted work.

There's uncertainty in all the dependencies you're using. Your dependencies will have bugs in them or will have unexpected behavior for certain inputs. The first time I hit a file handle limit error on Linux is an example of not understanding the limits of a dependency.

Finally, another big source of uncertainty is not understanding the range of inputs your software will see in production. This leads to anything from incorrect functionality to poor performance to security holes like injection or denial of service attacks.

This is by no means an exhaustive overview of sources of uncertainty in software, but it's clear that uncertainty permeates all of the software engineering process.

Engineering for uncertainty

You can do a much better job building robust software by being cognizant of the uncertain nature of software. I've learned many techniques over the years on how to design software better given the inherent uncertainties. I think these techniques should be part of the bread and butter skills for any software engineer, but I think too many engineers fall under the "software is deterministic" reasoning trap and fail to account for the implications of unexpected events happening in production.

Minimize dependencies

One technique for making software more robust is to minimize what your software depends on – the less moving parts, the better. Minimizing dependencies is more nuanced than just not depending on System X or System Y, but also includes minimizing dependencies on features of systems you are using.

Storm's usage of Zookeeper is a good example of this. The location of all workers throughout the cluster is stored in Zookeeper. When a worker gets reassigned, other workers must discover the new location as quickly as possible so that they can send messages to the correct place. There are two ways for workers to do this discovery, either via the pull method or the push method. In the pull method, workers periodically poll Zookeeper to get the updated worker locations. In the push method, a Zookeeper feature called "watches" is used for Zookeeper to send the information to all workers whenever the locations change. The push method immediately propogates the information, making it faster than the pull method, but it introduces a dependency on another feature of Zookeeper.

Storm uses both methods to propogate the worker location information. Every few seconds, Storm polls for updated worker information. In addition to this, Storm uses Zookeeper watches as an optimization to try to get the location information as fast as possible. This design ensures that even if the Zookeeper watch feature fails to work, a worker will still get the correct location information (albeit a bit slower in that particular instance). So Storm is able to take advantage of the watch feature without being fundamentally dependent on it. Most of the time the watch feature will work correctly and information will propogate quickly, but in the case that watches fail Storm will still work. This design turned out to be farsighted, as there was a serious bug in watches that would have affected Storm.

There's always a tradeoff between minimizing dependencies and minimizing the amount of code you need to produce to implement your application. In this case, doing the dual approach to location propogation was a good approach because it was a very small amount of code to achieve independence from that feature. On the other hand, removing Zookeeper as a dependency completely would not have been a good idea, as replicating that functionality would have been a huge amount of work and less reliable than just using a widely-used open-source project.

Lessen probability of cascading failures

A cascading failure is one of the worst things that can happen in production – when it happens it feels like the world is falling apart. One of the most common causes of cascading failures in my experience are accidental denial of service attacks like in my reportError example. The ultimate cause in these cases is a failure to respect the functional input range for components in your system. You can greatly reduce cascading failures by making interactions between components in your system explicitly respect those input ranges by using self-throttling to avoid accidental DOS'ng. This is the approach I used in my reportError example.

Another great technique for avoiding cascading failures is to isolate your components as much as possible and take away the ability for different components to affect each other. This is often easier said than done, but when possible it is a very useful technique.

Measure and monitor

When something unexpected happens in production, it's critical to have thorough monitoring in place so that you can figure out what happened. As software hardens more and more, unexpected events will get more and more infrequent and reproducing those events will become harder and harder. So when one of those unexpected events happens, you want as much data about the event as possible.

Software should be designed from the start to be monitored. I consider the monitoring aspects of software just as important as the functionality of the software itself. And everything should be measured – latencies, throughput stats, buffer sizes, and anything else relevant to the application. Monitoring is the most important defense against software's inherent uncertainty.

In the same vein, it's important to do measurements of all your components to gain an understanding of their functional input ranges. What throughputs can each component handle? How is latency affected by more traffic? How can you break those components? Doing this measurement work isn't glamorous but is essential to solid engineering.


Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs. Recognizing and planning for these uncertainties will make your software more reliable – and make you a better engineer.

You should follow me on Twitter here.


My new startup

There's been a lot of speculation about what my new startup is doing, so I've decided to set the record straight and reveal all. We are working on one of the biggest problems on Earth, a problem that affects nearly every person on this planet. Our products will significantly improve the quality of life for billions of people.

We are going to revolutionize the bedsheet industry.

Think about it. There's been almost no innovation in bedsheets in thousands of years. There's nothing worse than waking up to discover one of the corners of your Egyptian cotton fitted sheets has slipped off the mattress. How is this not a solved problem yet? Why are we still using sheets with that annoying elastic in it to secure them to our mattresses? They slip all the time – and if you have a deep mattress, good luck finding sheets that even fit. You're just screwed.

Consider the impact of solving this problem, of a bedsheet product that never slips, that always stays secure on the mattress. This translates to better sleep, to less grogginess in the morning, to feeling more upbeat in the morning. This translates into less morning arguments between husbands and wives that spiral into divorces, child custody battles, and decades of trauma for the children.

Not only is this a big problem – it's a big opportunity. We've done extensive market research and discovered that our target market is the entire human population. At 7 billion people and an estimated average sale of $20 per sheet, this is at least a $140,000,000,000 opportunity.

We are going to solve this problem using modern, 21st century techniques and take bedsheets out of the Stone Age and into the future. We are going to make it possible to attach your sheets to your bed, completely solving the problem.

Solving this problem in a practical and cost-efficient way is not easy and will require significant engineering breakthroughs. If you're a world-class, rock star by day and ninja by night engineer who's as passionate about bedsheets as I am, please get in touch. I'd love to talk to you.

Bedsheets have been my true passion since I was a child. I'm excited to finally be focused on what I really care about, and I can't wait until the day when untucked sheets are a curious relic of the past.


Leaving Twitter

Yesterday was my last day at Twitter. I left to start my own company. What I'll be working on is very exciting (though I'm keeping it secret for now).

Leaving Twitter was a tough decision. I worked with a whole bunch of great people on fascinating problems with some of the most interesting data in the world. Ultimately though, I felt that if I didn't make this move, I would regret it for the rest of my life. So I put in my papers about a month ago and then spent a month transitioning my team for my departure.

This ends an eventful three years that started with me joining BackType in January of 2010. So much has happened in these past three years. I open-sourced Cascalog, ElephantDB, and Storm, started writing a book, gave a lot of talks, and in July of 2011 experienced the thrill of being acquired. My projects spread beyond BackType and Twitter to be relied on by dozens and dozens of companies. Through all this, I learned an enormous amount about entrepreneurship, product development, marketing, recruiting, and project management.

Stay tuned.