Storm's 1st birthday

Storm was open-sourced exactly one year ago today. It's been an action-packed year for Storm, to say the least. Here's some of the exciting stuff that's happened over the past year:

  • 27 companies have publicized that they're using Storm in production. I know of at least a few more companies using it that haven't published anything yet.
  • O'Reilly published a book on Storm.
  • The Storm mailing list has over 1300 members, with over 500 messages per month.
  • The @stormprocessor account has over 1200 followers.
  • More than 4000 people have starred the project on Github.
  • There's a regular Storm meetup in the Bay Area with over 230 members. I've also seen lots of Storm-focused meetups happen all over the world over the past year.
  • 29 people all over the world have contributed to the codebase
  • We released Trident, a high level abstraction for realtime computation, that is a major leap forward in what's possible in realtime.
  • Libraries have been released integrating Storm with Kestrel, Kafka, JMS, Cassandra, Memcached, and many more systems. For many, Storm is becoming the system of choice for connecting these systems together.
  • Storm's performance has been increased by over 10x. I've benchmarked it at 1M messages per second per node on an internal Twitter cluster.

What I overwhelmingly hear from people is that they like Storm because it's simple to understand, flexible, and extremely robust in production. These have always been some of the core design goals of Storm, so I'm glad that we were able to succeed on these points.

We've got lots of exciting stuff planned over the next year. We have a new metrics system in development which will let you get deep insight into what's happening throughout your topology in realtime. And we have big plans for improving Trident and integrating it with more datastores and input sources.

Happy birthday Storm!


Suffering-oriented programming

Someone asked me an interesting question the other day: "How did you justify taking such a huge risk on building Storm while working on a startup?" (Storm is a realtime computation system). I can see how from an outsider's perspective investing in such a massive project seems extremely risky for a startup. From my perspective, though, building Storm wasn't risky at all. It was challenging, but not risky.

I follow a style of development that greatly reduces the risk of big projects like Storm. I call this style "suffering-oriented programming." Suffering-oriented programming can be summarized like so: don't build technology unless you feel the pain of not having it. It applies to the big, architectural decisions as well as the smaller everyday programming decisions. Suffering-oriented programming greatly reduces risk by ensuring that you're always working on something important, and it ensures that you are well-versed in a problem space before attempting a large investment.

I have a mantra for suffering-oriented programming: "First make it possible. Then make it beautiful. Then make it fast."

First make it possible

When encountering a problem domain with which you're unfamiliar, it's a mistake to try to build a "general" or "extensible" solution right off the bat. You just don't understand the problem domain well enough to anticipate what your needs will be in the future. You'll make things generic that needn't be, adding complexity and wasting time.

It's better to just "hack things out" and be very direct about solving the problems you have at hand. This allows you to get done what you need to get done and avoid wasted work. As you're hacking things out, you'll learn more and more about the intricacies of the problem space.

The "make it possible" phase for Storm was one year of hacking out a stream processing system using queues and workers. We learned about guaranteeing data processing using an "ack" protocol. We learned to scale our realtime computations with clusters of queues and workers. We learned that sometimes you need to partition a message stream in different ways, sometimes randomly and sometimes using a hash/mod technique that makes sure the same entity always goes to the same worker.

We didn't even know we were in the "make it possible" phase. We were just focused on building our products. The pain of the queues and workers system became acute very quickly though. Scaling the queues and workers system was tedious, and the fault-tolerance was nowhere near what we wanted. It was evident that the queues and workers paradigm was not at the right level of abstraction, as most of our code had to do with routing messages and serialization and not the actual business logic we cared about.

At the same time, developing our product drove us to discover new use cases in the "realtime computation" problem space. We built a feature for our product that would compute the reach of a URL on Twitter. Reach is the number of unique people exposed to a URL on Twitter. It's a difficult computation that can require hundreds of database calls and tens of millions of impressions to distinct just for one computation. Our original implementation that ran on a single machine would take over a minute for hard URLs, and it was clear that we needed a distributed system of some sort to parallelize the computation to make it fast.

One of the key realizations that sparked Storm was that the "reach problem" and the "stream processing" problem could be unified by a simple abstraction.

Then make it beautiful

You develop a "map" of the problem space as you explore it by hacking things out. Over time, you acquire more and more use cases within the problem domain and develop a deep understanding of the intricacies of building these systems. This deep understanding can guide the creation of "beautiful" technology to replace your existing systems, alleviate your suffering, and enable new systems/features that were too hard to build before.

The key to developing the "beautiful" solution is figuring out the simplest set of abstractions that solve the concrete use cases you already have. It's a mistake to try to anticipate use cases you don't actually have or else you'll end up overengineering your solution. As a rule of thumb, the bigger the investment you're trying to make, the deeper you need to understand the problem domain and the more diverse your use cases need to be. Otherwise you risk the second-system effect.

"Making it beautiful" is where you use your design and abstraction skills to distill the problem space into simple abstractions that can be composed together. I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you're looking for the simplest curve that fits those points (a set of abstractions).

The more use cases you have, the better you'll be able to find the right curve to fit those points. If you don't have enough points, you're likely to either overfit or underfit the graph, leading to wasted work and overengineering.

A big part of making it beautiful is understanding the performance and resource characteristics of the problem space. This is one of the intricacies you learn in the "making it possible" phase, and you should take advantage of that learning when designing your beautiful solution.

With Storm, I distilled the realtime computation problem domain into a small set of abstractions: streams, spouts, bolts, and topologies. I devised a new algorithm for guaranteeing data processing that eliminated the need for intermediate message brokers, the part of our system that caused the most complexity and suffering. That both stream processing and reach, two very different problems on the surface, mapped so elegantly to Storm was a strong indicator that I was onto something big.

I took additional steps to acquire more use cases for Storm and validate my designs. I canvassed other engineers to learn about the particulars of the realtime problems they were dealing with. I didn't just ask people I knew. I also tweeted out that I was working on a new realtime system and wanted to learn about other people's use cases. This led to a lot of interesting discussions that educated me more on the problem domain and validated my design ideas.

Then make it fast

Once you've built out your beautiful design, you can safely invest time in profiling and optimization. Doing optimization too early will just waste time, because you still might rethink the design. This is called premature optimization.

"Making it fast" isn't about the high level performance characteristics of a system. The understanding of those issues should have been acquired in the "make it possible" phase and designed for in the "make it beautiful" phase. "Making it fast" is about micro-optimizations and tightening up the code to be more resource efficient. So you might worry about things like asymptotic complexity in the "make it beautiful" phase and focus on the constant-time factors in the "make it fast" phase.

Rinse and repeat

Suffering-oriented programming is a continuous process. The beautiful systems you build give you new capabilities, which allow you to "make it possible" in new and deeper areas of the problem space. This feeds learning back to the technology. You often have to tweak or add to the abstractions you've already come up with to handle more and more use cases.

Storm has gone through many iterations like this. When we first started using Storm, we discovered that we needed the capability to emit multiple, independent streams from a single component. We discovered that the addition of a special kind of stream called the "direct stream" would allow Storm to process batches of tuples as a concrete unit. Recently I developed "transactional topologies" which go beyond Storm's at-least-once processing guarantee and allow exactly-once messaging semantics to be achieved for nearly arbitrary realtime computation.

By its nature, hacking things out in a problem domain you don't understand so well and constantly iterating can lead to some sloppy code. The most important characteristic of a suffering-oriented programmer is a relentless focus on refactoring. This is critical to prevent accidental complexity from sabotaging the codebase.


Use cases are everything in suffering-oriented programming. They're worth their weight in gold. The only way to acquire use cases is through gaining experience through hacking.

There's a certain evolution most programmers go through. You start off struggling to get things to work and have absolutely no structure to your code. Code is sloppy and copy/pasting is prevalent. Eventually you learn about the benefits of structured programming and sharing logic as much as possible. Then you learn about making generic abstractions and using encapsulation to make it easier to reason about systems. Then you become obsessed with making all your code generic, with making things extensible to future-proof your programs.

Suffering-oriented programming rejects that you can effectively anticipate needs you don't currently have. It recognizes that attempts to make things generic without a deep understanding of the problem domain will lead to complexity and waste. Designs must always be driven by real, tangible use cases.

You should follow me on Twitter here.


Early access edition of my book is available

The early access edition of my book Big Data: principles and best practices of scalable realtime data systems is now available from Manning! I've been working on this book for quite some time, and I'm excited to have it out there and start getting some feedback.

The interest in the book has already been overwhelming, and I've been answering questions about it on Hacker News.


How to beat the CAP theorem

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can't sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn't available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn't in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you've ever used a product that told you to "try again later", you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is "eventual consistency". If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called "read repair").

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I'm saying that you're damned if you do and damned if you don't. The CAP theorem is a fact of nature, so what alternative can there possibly be?

There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems. The complexity caused by the CAP theorem is a symptom of fundamental problems in how we approach building data systems. Two problems stand out in particular: the use of mutable state in databases and the use of incremental algorithms to update that state. It is the interaction between these problems and the CAP theorem that causes complexity.

In this post I'll show the design of a system that beats the CAP theorem by preventing the complexity it normally causes. But I won't stop there. The CAP theorem is a result about the degree to which data systems can be fault-tolerant to machine failure. Yet there's a form of fault-tolerance that's much more important than machine fault-tolerance: human fault-tolerance. If there's any certainty in software development, it's that developers aren't perfect and bugs will inevitably reach production. Our data systems must be resilient to buggy programs that write bad data, and the system I'm going to show is as human fault-tolerant as you can get.

This post is going to challenge your basic assumptions on how data systems should be built. But by breaking down our current ways of thinking and re-imagining how data systems should be built, what emerges is an architecture more elegant, scalable, and robust than you ever thought possible.

Click to read more ...


My talks at POSSCON

Last week I went to POSSCON in Columbia, South Carolina. It was an interesting experience and a good reminder that not everyone in the world thinks like we do in Silicon Valley.

I gave two talks at the conference. One was a technical talk about how to build realtime Big Data systems, and the other was a non-technical talk about the things we do at BackType to be a super-productive team. Both slide decks are embedded below.


Inglourious Software Patents

Most articles arguing for the abolishment of software patents focus on how so many software patents don't meet the "non-obvious and non-trivial" guidelines for patents. The problem with this approach is that the same argument could be used to advocate for reform in how software patents are evaluated rather than the abolishment of software patents altogether.

Software patents should be abolished though, and I'm going to show this with an economic analysis. We'll see that even non-obvious and non-trivial software patents should never be granted as they can only cause economic loss.

Why do patents exist in the first place?

The patent system exists to provide an incentive for innovation where that incentive would not have existed otherwise.

Imagine you're an individual living in the 19th century. Let's say the patent system does not exist and you have an idea to make a radically better kind of sewing machine. If you invested the time to develop your idea into a working invention, the existing sewing machine companies would just steal your design and crush you in the marketplace. They have massive distribution and production advantages that you wouldn't be able to compete with. You wouldn't be able to monetize the initial investment you made into developing that invention. Therefore, you wouldn't have invented the radically better sewing machine in the first place.

From this perspective, patents are actually a rather clever hack on society to encourage innovation. By excluding others from using your invention for a fixed amount of time, you get a temporary monopoly on your invention. This lets you monetize your invention which makes your initial investment worthwhile. This in turn benefits society as a whole, as now society has inventions that it wouldn't have had otherwise.

The patent system does not exist to protect intellectual property as a goal unto itself. If the incentive to create the innovation was there without the patent system, then the patent system is serving no purpose.

After all, there is a cost to the patent system. There's no hard and fast way to determine whether an invention required the promise of a patent for its creation, so inevitably some patents will be awarded to inventions that would have been created anyway. The patent system creates monopolies out of these inventions that would have existed in a competitive marketplace otherwise. These are "accidental monopolies" in the sense that they are unintended consequences of a patent system trying to encourage innovation that wouldn't have occurred without the patent system.

Accidental monopolies are the cost of the patent system. For the patent system to be worthwhile, the amount of benefit from inventions that wouldn't have existed otherwise should exceed the cost of accidental monopolies. The purpose of the "non-obvious and non-trivial" guidelines is to try to minimize the number of patents awarded that create accidental monopolies.

Innovation in software

Patents are not necessary for innovation to occur in software. You'll have a hard time finding many examples of software innovations that wouldn't have been made without the promise of a patent. This means that every software patent creates an accidental monopoly.

A good place to look at the importance of patents to software innovation is startups. Startups must innovate if they want to become sustainable businesses. The question is -- do patents encourage innovation in startups by protecting them from having their ideas stolen?

Quite the opposite. Software startups are thriving nowadays in spite of software patents rather than because of them. Instead of helping startups get off the ground, patents are a cost. Startups must build "defensive patent portfolios" and worry about getting sued by patent trolls or businesses trying to entrench their position. Instead of patents being a protective shield for a startup, they're instead a weapon that causes economic waste.

It's hard for a big company to just steal a software idea. Being big just isn't the advantage in the software industry as it was in our sewing machine example. They don't have the same production and distribution advantages since the internet makes it cost practically nothing to distribute software. Furthermore, it's not that easy to just copy a software product. Look at what happened with Google Buzz.

At my company BackType, we're doing a lot of innovative things with Big Data systems. Rather than try to patent our ideas and achieve an exclusive monopoly on what we invent, we're doing the opposite. We're sharing these inventions with the world by open-sourcing them. We do this because it helps so much with recruiting: it establishes us as a serious technology company, and programmers want to work at companies where they can contribute to open source.

Vivek Wadwha has a good post showing the stats on how counter-productive patents are in the software industry.

Since massive amounts of software innovation occurs in spite of the patent system, software patents are irrational.

The openness argument

Another purported reason for the existence of patents is that it encourages inventions to be shared rather than be kept secret. This ensures the invention enters the public domain and prevents the invention from ever being lost.

This argument doesn't hold with software. Just look at the facts:

  • The academic community publishes their innovations to the public.
  • There is a massive and rapidly growing amount of innovative open source software.
  • Companies have strong incentives to participate in open source.

When I'm looking for innovative software approaches, I search the Internet or I look at research papers. I never look at software patents, and I don't know anyone in the software industry who would.


The fashion industry is an excellent example of an industry that has no patents and thrives.

Even non-obvious and non-trivial software ideas should not be patentable, because the promise of a patent is not necessary for innovation in software. The economics are clear: software patents should be abolished.

You should follow me on Twitter here.


Cascalog workshop

I'll be teaching a Cascalog workshop on February 19th at BackType HQ in Union Square. You can sign up at Early bird tickets are available until January 31st.

I'm very excited to be teaching this workshop. Cascalog's tight integration with Clojure opens up a world of techniques that no other data processing tool is able to do. Even though I created Cascalog, I've been discovering many of these techniques as I've made use of Cascalog for more and more varied tasks. Along the way, I've tweaked Cascalog so that making use of these techniques would be cleaner and more idiomatic. At this point, after nine months of iteration, Cascalog is a joy to use for even the most complex tasks. I'm excited to impart this knowledge upon others in this workshop.


Analysis of the #LessAmbitiousMovies Twitter Meme

We did a fun post on the BackType blog today analyzing a meme that took off on Twitter this week. A person with about 500 followers started the meme that eventually reached more than 27 million people. Check out our analysis here, and you can check out TechCrunch coverage of our analysis here.

Doing the analysis was relatively simple. We extracted an 80 MB dataset of the tweets involved in the meme from our 25 TB social dataset. We downloaded that data to a local computer and ran queries on the data from a Clojure REPL using Cascalog. Doing the data analysis only took us a couple hours.


How to reject a job candidate without being an asshole

I used to send a job candidate an email like the following after the person failed a phone interview:

Hi [Candidate],

Thanks a lot for taking the time to interview with us. However, we've decided not to move forward in the process with you. I wish you the best of luck on your future projects.


Rejection emails like that are cold, impersonal, and hollow. It doesn't feel good to send an email like that, and it sure doesn't feel good to receive an email like that.

You want a candidate to feel good about your company even after failing the interview process. You don't want the candidate to discourage their friends from interviewing with your company, and ideally you want the candidate to refer their friends to your company.

Now I tell candidates on the spot whether they pass or fail at the end of the phone interview. I give them feedback on what they did well and what they did poorly. I'm very candid with them.

The first time I tried this, I had butterflies in my stomach. I thought to myself, "What if he gets angry? What if he freaks out on me and starts screaming obscenities?" It turns out my fears were completely unfounded.

I've found that candidates really appreciate my candor about their rejection. They appreciate being told why they failed, especially because everyone else is so impersonal and non-specific about rejections.

When I reject a candidate, I lead with what I liked about the person. For example, I might tell the candidate how I thought the project he described was interesting. I find that leading with an honest compliment softens the blow of the impending criticism. I haven't yet been at a loss for something nice to say to a candidate, and it would be a rare candidate for that to happen.

Next, I tell the candidate why we can't move forward in the process with them. I tell the candidate what we're looking for and why I think they're not qualified. Let's face it: candidates know when they didn't do a good job. They know if they bombed the coding question you gave them. I tell the candidate that I know it's possible that we're making a mistake, that coding questions aren't always the most accurate indicators, but that it's the decision we have to make based on the results of the interview.

Finally, I give candidates a chance to give me feedback. I ask what they thought about our interview process, and I ask what I could do better as an interviewer. I've gotten some valuable feedback this way, and I'm becoming a better interviewer because of the feedback. Plus, giving candidates the opportunity to criticize me makes the process feel a lot less one-sided.

I'm really happy about my decision to reject candidates on the spot at the end of phone interviews rather than with impersonal emails. I don't feel like an asshole anymore, and candidates seem to appreciate the feedback.

You should follow me on Twitter here.


You Are a Product

I had a revelation the other day. I realized that the terms "programmer" and "employee" are inadequate to describe what I am. What I am is a product, and you are one too. If you want to develop your career, you need to approach your career as a product development problem.

You sell yourself for various things: money, status, the opportunity to work on interesting problems, good coworkers, etc. In this post I'll be referring to this as "getting paid", but please keep in mind that "getting paid" means more than just money.

Supply and Demand

Like any product, you have supply and demand. Your supply is what you can do for a company that hires you. It's your ability to make beautiful websites. It's your ability to scale a database. It's your ability to get the best work out of others. Your supply is the actual value you will provide to a company that hires you.

Your demand is what companies think you can do for them. Your demand is your perceived value by others. At the end of the day, you will be paid according to how you're perceived, not by the actual value you can produce. This is why so many 10x engineers don't actually get paid 10x -- they're not publicly perceived as 10x engineers so normal market forces are unable to bid up their value.

I see way too many people say to themselves "As long as I just put out quality work, I'll be taken care of." This is bullshit. This way of thinking prevents you from reaching your potential. It prevents you from being paid what you should be getting paid, and it prevents you from bettering your status. You cannot just focus on your supply. Supply is only 50% of the equation. You could be the greatest programmer to ever live, but if no one knows that, it won't help you. You are a product, and if you want to get paid appropriately, you have to work on your demand.

Personal branding

Influencing your demand is called "personal branding." It's marketing. Your actual value -- your supply -- is important to the extent that you can use it to raise your perceived value -- your demand.

Personal branding is inherently a public activity. Market forces rely on information being public. You want lots of people to believe that you can provide them with lots of value. This will lead to opportunities for you. Many of these opportunities will be out of the blue and unexpected.

There are lots of things you can do to increase your demand. Start a blog and promote it through Twitter and social news sites. Speak at conferences. Build social proof by building up your Twitter follower list. Participate in open source projects and write blog posts about the work you're doing on the projects.

I think open source is the best activity a programmer can engage in. It makes public your actual ability to solve problems and write code. You should strongly prefer to work at companies with a culture of making and contributing to open source projects, as that gives you the opportunity to market yourself.

I think the best personal branding activities are rooted in the actual value you can provide to others. There are other activities that can increase your demand, like taking credit for the work of others, that are flat out unethical. Don't be a product that's just smoke and mirrors.

Marketing yourself takes work, but it's something that gets easier with practice. It would be stupid to release a product to market without promoting and marketing it. Likewise, you should treat yourself as a product and market yourself as such. When you do, you can watch the forces of supply and demand work their magic.

You should follow me on Twitter here.