My new startup

There's been a lot of speculation about what my new startup is doing, so I've decided to set the record straight and reveal all. We are working on one of the biggest problems on Earth, a problem that affects nearly every person on this planet. Our products will significantly improve the quality of life for billions of people.

We are going to revolutionize the bedsheet industry.

Think about it. There's been almost no innovation in bedsheets in thousands of years. There's nothing worse than waking up to discover one of the corners of your Egyptian cotton fitted sheets has slipped off the mattress. How is this not a solved problem yet? Why are we still using sheets with that annoying elastic in it to secure them to our mattresses? They slip all the time – and if you have a deep mattress, good luck finding sheets that even fit. You're just screwed.

Consider the impact of solving this problem, of a bedsheet product that never slips, that always stays secure on the mattress. This translates to better sleep, to less grogginess in the morning, to feeling more upbeat in the morning. This translates into less morning arguments between husbands and wives that spiral into divorces, child custody battles, and decades of trauma for the children.

Not only is this a big problem – it's a big opportunity. We've done extensive market research and discovered that our target market is the entire human population. At 7 billion people and an estimated average sale of $20 per sheet, this is at least a $140,000,000,000 opportunity.

We are going to solve this problem using modern, 21st century techniques and take bedsheets out of the Stone Age and into the future. We are going to make it possible to attach your sheets to your bed, completely solving the problem.

Solving this problem in a practical and cost-efficient way is not easy and will require significant engineering breakthroughs. If you're a world-class, rock star by day and ninja by night engineer who's as passionate about bedsheets as I am, please get in touch. I'd love to talk to you.

Bedsheets have been my true passion since I was a child. I'm excited to finally be focused on what I really care about, and I can't wait until the day when untucked sheets are a curious relic of the past.


Leaving Twitter

Yesterday was my last day at Twitter. I left to start my own company. What I'll be working on is very exciting (though I'm keeping it secret for now).

Leaving Twitter was a tough decision. I worked with a whole bunch of great people on fascinating problems with some of the most interesting data in the world. Ultimately though, I felt that if I didn't make this move, I would regret it for the rest of my life. So I put in my papers about a month ago and then spent a month transitioning my team for my departure.

This ends an eventful three years that started with me joining BackType in January of 2010. So much has happened in these past three years. I open-sourced Cascalog, ElephantDB, and Storm, started writing a book, gave a lot of talks, and in July of 2011 experienced the thrill of being acquired. My projects spread beyond BackType and Twitter to be relied on by dozens and dozens of companies. Through all this, I learned an enormous amount about entrepreneurship, product development, marketing, recruiting, and project management.

Stay tuned.


Storm's 1st birthday

Storm was open-sourced exactly one year ago today. It's been an action-packed year for Storm, to say the least. Here's some of the exciting stuff that's happened over the past year:

  • 27 companies have publicized that they're using Storm in production. I know of at least a few more companies using it that haven't published anything yet.
  • O'Reilly published a book on Storm.
  • The Storm mailing list has over 1300 members, with over 500 messages per month.
  • The @stormprocessor account has over 1200 followers.
  • More than 4000 people have starred the project on Github.
  • There's a regular Storm meetup in the Bay Area with over 230 members. I've also seen lots of Storm-focused meetups happen all over the world over the past year.
  • 29 people all over the world have contributed to the codebase
  • We released Trident, a high level abstraction for realtime computation, that is a major leap forward in what's possible in realtime.
  • Libraries have been released integrating Storm with Kestrel, Kafka, JMS, Cassandra, Memcached, and many more systems. For many, Storm is becoming the system of choice for connecting these systems together.
  • Storm's performance has been increased by over 10x. I've benchmarked it at 1M messages per second per node on an internal Twitter cluster.

What I overwhelmingly hear from people is that they like Storm because it's simple to understand, flexible, and extremely robust in production. These have always been some of the core design goals of Storm, so I'm glad that we were able to succeed on these points.

We've got lots of exciting stuff planned over the next year. We have a new metrics system in development which will let you get deep insight into what's happening throughout your topology in realtime. And we have big plans for improving Trident and integrating it with more datastores and input sources.

Happy birthday Storm!


Suffering-oriented programming

Someone asked me an interesting question the other day: "How did you justify taking such a huge risk on building Storm while working on a startup?" (Storm is a realtime computation system). I can see how from an outsider's perspective investing in such a massive project seems extremely risky for a startup. From my perspective, though, building Storm wasn't risky at all. It was challenging, but not risky.

I follow a style of development that greatly reduces the risk of big projects like Storm. I call this style "suffering-oriented programming." Suffering-oriented programming can be summarized like so: don't build technology unless you feel the pain of not having it. It applies to the big, architectural decisions as well as the smaller everyday programming decisions. Suffering-oriented programming greatly reduces risk by ensuring that you're always working on something important, and it ensures that you are well-versed in a problem space before attempting a large investment.

I have a mantra for suffering-oriented programming: "First make it possible. Then make it beautiful. Then make it fast."

First make it possible

When encountering a problem domain with which you're unfamiliar, it's a mistake to try to build a "general" or "extensible" solution right off the bat. You just don't understand the problem domain well enough to anticipate what your needs will be in the future. You'll make things generic that needn't be, adding complexity and wasting time.

It's better to just "hack things out" and be very direct about solving the problems you have at hand. This allows you to get done what you need to get done and avoid wasted work. As you're hacking things out, you'll learn more and more about the intricacies of the problem space.

The "make it possible" phase for Storm was one year of hacking out a stream processing system using queues and workers. We learned about guaranteeing data processing using an "ack" protocol. We learned to scale our realtime computations with clusters of queues and workers. We learned that sometimes you need to partition a message stream in different ways, sometimes randomly and sometimes using a hash/mod technique that makes sure the same entity always goes to the same worker.

We didn't even know we were in the "make it possible" phase. We were just focused on building our products. The pain of the queues and workers system became acute very quickly though. Scaling the queues and workers system was tedious, and the fault-tolerance was nowhere near what we wanted. It was evident that the queues and workers paradigm was not at the right level of abstraction, as most of our code had to do with routing messages and serialization and not the actual business logic we cared about.

At the same time, developing our product drove us to discover new use cases in the "realtime computation" problem space. We built a feature for our product that would compute the reach of a URL on Twitter. Reach is the number of unique people exposed to a URL on Twitter. It's a difficult computation that can require hundreds of database calls and tens of millions of impressions to distinct just for one computation. Our original implementation that ran on a single machine would take over a minute for hard URLs, and it was clear that we needed a distributed system of some sort to parallelize the computation to make it fast.

One of the key realizations that sparked Storm was that the "reach problem" and the "stream processing" problem could be unified by a simple abstraction.

Then make it beautiful

You develop a "map" of the problem space as you explore it by hacking things out. Over time, you acquire more and more use cases within the problem domain and develop a deep understanding of the intricacies of building these systems. This deep understanding can guide the creation of "beautiful" technology to replace your existing systems, alleviate your suffering, and enable new systems/features that were too hard to build before.

The key to developing the "beautiful" solution is figuring out the simplest set of abstractions that solve the concrete use cases you already have. It's a mistake to try to anticipate use cases you don't actually have or else you'll end up overengineering your solution. As a rule of thumb, the bigger the investment you're trying to make, the deeper you need to understand the problem domain and the more diverse your use cases need to be. Otherwise you risk the second-system effect.

"Making it beautiful" is where you use your design and abstraction skills to distill the problem space into simple abstractions that can be composed together. I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you're looking for the simplest curve that fits those points (a set of abstractions).

The more use cases you have, the better you'll be able to find the right curve to fit those points. If you don't have enough points, you're likely to either overfit or underfit the graph, leading to wasted work and overengineering.

A big part of making it beautiful is understanding the performance and resource characteristics of the problem space. This is one of the intricacies you learn in the "making it possible" phase, and you should take advantage of that learning when designing your beautiful solution.

With Storm, I distilled the realtime computation problem domain into a small set of abstractions: streams, spouts, bolts, and topologies. I devised a new algorithm for guaranteeing data processing that eliminated the need for intermediate message brokers, the part of our system that caused the most complexity and suffering. That both stream processing and reach, two very different problems on the surface, mapped so elegantly to Storm was a strong indicator that I was onto something big.

I took additional steps to acquire more use cases for Storm and validate my designs. I canvassed other engineers to learn about the particulars of the realtime problems they were dealing with. I didn't just ask people I knew. I also tweeted out that I was working on a new realtime system and wanted to learn about other people's use cases. This led to a lot of interesting discussions that educated me more on the problem domain and validated my design ideas.

Then make it fast

Once you've built out your beautiful design, you can safely invest time in profiling and optimization. Doing optimization too early will just waste time, because you still might rethink the design. This is called premature optimization.

"Making it fast" isn't about the high level performance characteristics of a system. The understanding of those issues should have been acquired in the "make it possible" phase and designed for in the "make it beautiful" phase. "Making it fast" is about micro-optimizations and tightening up the code to be more resource efficient. So you might worry about things like asymptotic complexity in the "make it beautiful" phase and focus on the constant-time factors in the "make it fast" phase.

Rinse and repeat

Suffering-oriented programming is a continuous process. The beautiful systems you build give you new capabilities, which allow you to "make it possible" in new and deeper areas of the problem space. This feeds learning back to the technology. You often have to tweak or add to the abstractions you've already come up with to handle more and more use cases.

Storm has gone through many iterations like this. When we first started using Storm, we discovered that we needed the capability to emit multiple, independent streams from a single component. We discovered that the addition of a special kind of stream called the "direct stream" would allow Storm to process batches of tuples as a concrete unit. Recently I developed "transactional topologies" which go beyond Storm's at-least-once processing guarantee and allow exactly-once messaging semantics to be achieved for nearly arbitrary realtime computation.

By its nature, hacking things out in a problem domain you don't understand so well and constantly iterating can lead to some sloppy code. The most important characteristic of a suffering-oriented programmer is a relentless focus on refactoring. This is critical to prevent accidental complexity from sabotaging the codebase.


Use cases are everything in suffering-oriented programming. They're worth their weight in gold. The only way to acquire use cases is through gaining experience through hacking.

There's a certain evolution most programmers go through. You start off struggling to get things to work and have absolutely no structure to your code. Code is sloppy and copy/pasting is prevalent. Eventually you learn about the benefits of structured programming and sharing logic as much as possible. Then you learn about making generic abstractions and using encapsulation to make it easier to reason about systems. Then you become obsessed with making all your code generic, with making things extensible to future-proof your programs.

Suffering-oriented programming rejects that you can effectively anticipate needs you don't currently have. It recognizes that attempts to make things generic without a deep understanding of the problem domain will lead to complexity and waste. Designs must always be driven by real, tangible use cases.

You should follow me on Twitter here.


Early access edition of my book is available

The early access edition of my book Big Data: principles and best practices of scalable realtime data systems is now available from Manning! I've been working on this book for quite some time, and I'm excited to have it out there and start getting some feedback.

The interest in the book has already been overwhelming, and I've been answering questions about it on Hacker News.


How to beat the CAP theorem

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can't sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn't available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn't in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you've ever used a product that told you to "try again later", you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is "eventual consistency". If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called "read repair").

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I'm saying that you're damned if you do and damned if you don't. The CAP theorem is a fact of nature, so what alternative can there possibly be?

There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems. The complexity caused by the CAP theorem is a symptom of fundamental problems in how we approach building data systems. Two problems stand out in particular: the use of mutable state in databases and the use of incremental algorithms to update that state. It is the interaction between these problems and the CAP theorem that causes complexity.

In this post I'll show the design of a system that beats the CAP theorem by preventing the complexity it normally causes. But I won't stop there. The CAP theorem is a result about the degree to which data systems can be fault-tolerant to machine failure. Yet there's a form of fault-tolerance that's much more important than machine fault-tolerance: human fault-tolerance. If there's any certainty in software development, it's that developers aren't perfect and bugs will inevitably reach production. Our data systems must be resilient to buggy programs that write bad data, and the system I'm going to show is as human fault-tolerant as you can get.

This post is going to challenge your basic assumptions on how data systems should be built. But by breaking down our current ways of thinking and re-imagining how data systems should be built, what emerges is an architecture more elegant, scalable, and robust than you ever thought possible.

Click to read more ...


My talks at POSSCON

Last week I went to POSSCON in Columbia, South Carolina. It was an interesting experience and a good reminder that not everyone in the world thinks like we do in Silicon Valley.

I gave two talks at the conference. One was a technical talk about how to build realtime Big Data systems, and the other was a non-technical talk about the things we do at BackType to be a super-productive team. Both slide decks are embedded below.


Inglourious Software Patents

Most articles arguing for the abolishment of software patents focus on how so many software patents don't meet the "non-obvious and non-trivial" guidelines for patents. The problem with this approach is that the same argument could be used to advocate for reform in how software patents are evaluated rather than the abolishment of software patents altogether.

Software patents should be abolished though, and I'm going to show this with an economic analysis. We'll see that even non-obvious and non-trivial software patents should never be granted as they can only cause economic loss.

Why do patents exist in the first place?

The patent system exists to provide an incentive for innovation where that incentive would not have existed otherwise.

Imagine you're an individual living in the 19th century. Let's say the patent system does not exist and you have an idea to make a radically better kind of sewing machine. If you invested the time to develop your idea into a working invention, the existing sewing machine companies would just steal your design and crush you in the marketplace. They have massive distribution and production advantages that you wouldn't be able to compete with. You wouldn't be able to monetize the initial investment you made into developing that invention. Therefore, you wouldn't have invented the radically better sewing machine in the first place.

From this perspective, patents are actually a rather clever hack on society to encourage innovation. By excluding others from using your invention for a fixed amount of time, you get a temporary monopoly on your invention. This lets you monetize your invention which makes your initial investment worthwhile. This in turn benefits society as a whole, as now society has inventions that it wouldn't have had otherwise.

The patent system does not exist to protect intellectual property as a goal unto itself. If the incentive to create the innovation was there without the patent system, then the patent system is serving no purpose.

After all, there is a cost to the patent system. There's no hard and fast way to determine whether an invention required the promise of a patent for its creation, so inevitably some patents will be awarded to inventions that would have been created anyway. The patent system creates monopolies out of these inventions that would have existed in a competitive marketplace otherwise. These are "accidental monopolies" in the sense that they are unintended consequences of a patent system trying to encourage innovation that wouldn't have occurred without the patent system.

Accidental monopolies are the cost of the patent system. For the patent system to be worthwhile, the amount of benefit from inventions that wouldn't have existed otherwise should exceed the cost of accidental monopolies. The purpose of the "non-obvious and non-trivial" guidelines is to try to minimize the number of patents awarded that create accidental monopolies.

Innovation in software

Patents are not necessary for innovation to occur in software. You'll have a hard time finding many examples of software innovations that wouldn't have been made without the promise of a patent. This means that every software patent creates an accidental monopoly.

A good place to look at the importance of patents to software innovation is startups. Startups must innovate if they want to become sustainable businesses. The question is -- do patents encourage innovation in startups by protecting them from having their ideas stolen?

Quite the opposite. Software startups are thriving nowadays in spite of software patents rather than because of them. Instead of helping startups get off the ground, patents are a cost. Startups must build "defensive patent portfolios" and worry about getting sued by patent trolls or businesses trying to entrench their position. Instead of patents being a protective shield for a startup, they're instead a weapon that causes economic waste.

It's hard for a big company to just steal a software idea. Being big just isn't the advantage in the software industry as it was in our sewing machine example. They don't have the same production and distribution advantages since the internet makes it cost practically nothing to distribute software. Furthermore, it's not that easy to just copy a software product. Look at what happened with Google Buzz.

At my company BackType, we're doing a lot of innovative things with Big Data systems. Rather than try to patent our ideas and achieve an exclusive monopoly on what we invent, we're doing the opposite. We're sharing these inventions with the world by open-sourcing them. We do this because it helps so much with recruiting: it establishes us as a serious technology company, and programmers want to work at companies where they can contribute to open source.

Vivek Wadwha has a good post showing the stats on how counter-productive patents are in the software industry.

Since massive amounts of software innovation occurs in spite of the patent system, software patents are irrational.

The openness argument

Another purported reason for the existence of patents is that it encourages inventions to be shared rather than be kept secret. This ensures the invention enters the public domain and prevents the invention from ever being lost.

This argument doesn't hold with software. Just look at the facts:

  • The academic community publishes their innovations to the public.
  • There is a massive and rapidly growing amount of innovative open source software.
  • Companies have strong incentives to participate in open source.

When I'm looking for innovative software approaches, I search the Internet or I look at research papers. I never look at software patents, and I don't know anyone in the software industry who would.


The fashion industry is an excellent example of an industry that has no patents and thrives.

Even non-obvious and non-trivial software ideas should not be patentable, because the promise of a patent is not necessary for innovation in software. The economics are clear: software patents should be abolished.

You should follow me on Twitter here.


Cascalog workshop

I'll be teaching a Cascalog workshop on February 19th at BackType HQ in Union Square. You can sign up at Early bird tickets are available until January 31st.

I'm very excited to be teaching this workshop. Cascalog's tight integration with Clojure opens up a world of techniques that no other data processing tool is able to do. Even though I created Cascalog, I've been discovering many of these techniques as I've made use of Cascalog for more and more varied tasks. Along the way, I've tweaked Cascalog so that making use of these techniques would be cleaner and more idiomatic. At this point, after nine months of iteration, Cascalog is a joy to use for even the most complex tasks. I'm excited to impart this knowledge upon others in this workshop.


Analysis of the #LessAmbitiousMovies Twitter Meme

We did a fun post on the BackType blog today analyzing a meme that took off on Twitter this week. A person with about 500 followers started the meme that eventually reached more than 27 million people. Check out our analysis here, and you can check out TechCrunch coverage of our analysis here.

Doing the analysis was relatively simple. We extracted an 80 MB dataset of the tweets involved in the meme from our 25 TB social dataset. We downloaded that data to a local computer and ran queries on the data from a Clojure REPL using Cascalog. Doing the data analysis only took us a couple hours.