Break into Silicon Valley with a blog

I know a lot of non-technical people who would love to work in the venture-funded startup world, from consultants to finance people to other business types for which I'm not really sure exactly what it is they do. They hit obstacles trying to get into the startup world, finding that their skills are either irrelevant or hard to explain. My advice to all these people is the same:

Write a blog.

A blog can improve your life in enormous ways. Or to put it in business-speak: a blog has one of the highest ROI's of anything you can do.

Put yourself in the shoes of startups looking for talent. First off – startups are desperate for talent. The problem is that it's very difficult to identify great people – startups search through loads and loads of candidates.

Resumes and interviews only tell you so much about a person. It's really hard to stand out in a resume – you're not the only one putting over-inflated impressive-looking numbers and bullet points on your resume. And interviews are notorious for labeling bad people as good and good people as bad. So to maximize your odds of making it through the funnel, you need to show that you're awesome independent from the randomness of the normal process.

One thing you can do is write an insightful blog. This makes you look a lot more compelling. Now the reaction from startups will be "Hey, this person's really smart. We don't want to miss out on a potentially great hire, so let's put in a lot of effort to determine if there's a good fit."

A new dimension of opportunity

There's another huge advantage to having a blog besides being a mechanism to show that you're smart and insightful. A blog opens up a whole new dimension of opportunity for you. Instead of relying purely on outbound opportunities that you explicitly seek out yourself, you also will get inbound opportunities where people reach out to you with opportunities you never expected or dreamed of.

With an outbound opportunity you know exactly what you're seeking, whether it's landing a job or speaking at a conference or something else. Inbound opportunities, on the other hand, are highly uncertain. They come to you out of the blue. In my personal experience, many of the most awesome things I've done started as inbound opportunities: a book deal, flying all around the world for free to speak at conferences, a keynote at a major conference, and connecting with hundreds of awesome people who have reached out to me because of something I did or wrote publicly.

When you write a blog, you greatly increase the likelihood of getting awesome inbound opportunities. When it comes to breaking into Silicon Valley – instead of everything being on your shoulders to seek interesting companies, those companies will be reaching out to you.

A great phrase I've heard for this is "increasing your luck surface area". By providing value to people publicly, like writing insightful posts on a blog, you open yourself up to serendipitous, "lucky" opportunities.

Getting readers

Besides writing smart posts, you also need people to read your writing. Here's a few tips for accomplishing that.

First off, the title of a blog post is incredibly important. In very few words, you need to sell your potential reader that your blog post is going to be worth their time. I've found the the best titles are relevant to the potential reader, somewhat mysterious, and non-generic. Titles are definitely an art form, so you should think hard about how you'll name your posts. Sometimes I wait days to publish a post because I haven't thought of a compelling enough title.

Second, I highly recommend using Twitter as a distribution platform for your blog posts. The combination of Twitter and blogging leads to a beautifully virtuous cycle: your blog increases your Twitter following, and as your Twitter following grows you increase the reach of your blog. I consider Twitter to be the greatest professional networking tool ever devised – I follow people who tweet/blog interesting things and they follow me for the same reason. Then when I go to conferences I seek out the people who I know and respect from their online presence. When we meet, we already know a lot about each other and have a lot to talk about.

Lastly, you should embrace the online communities who will care about your blog. In Silicon Valley, the most important community is Hacker News. Hacker News is widely read in Silicon Valley by programmers, entrepreneurs, and investors. It can drive a lot of readers to your blog in a short amount of time.

Initially, it may be hard for you to get readers. Getting your posts on Hacker News is very much a crapshoot, and initially you'll have too small of a Twitter following to get that much distribution. But occasionally you'll write something smart that gets on Hacker News and gets shared around. Over time as your writing and distribution improves getting readers gets easier and easier.

What to write about

If you don't think you have anything to write about, then let me ask you a question. Do you really have that low of an opinion of yourself? Do you really think you have nothing interesting that you can share with the world? There's tons of stuff that you can write about that you don't even know to share. You have a ton of knowledge that you don't realize other people don't know because you spend all your time in your own head. Tell stories of times that you hustled. Write about the dynamics of big companies. Write case studies of anything related to running a business. Analyze the market for interesting new technologies (e.g. 3D printing, Bitcoin, etc). There's so much that you can write about.

Once you start blogging, you'll become attuned to random ideas you have throughout the day that would make good blog posts. Most of my blog ideas start off as email reminders to myself.

Final thoughts

If you haven't blogged before, you're going to suck at first. Being accurate, precise, and insightful is not enough. You have to learn how to hook people into your posts and keep the post engaging. You'll learn about the glorious world of internet commenting, where people constantly misinterpret what you say and apply very fallacious reasoning to your posts. You'll see people trash your ideas on Hacker News even though it's clear they didn't read your entire post. Sometimes they comment having only read the title! You'll learn over time different ways to structure the same information in order to minimize misinterpretation. You'll learn to anticipate fallacious reasoning and preemptively address those fallacies.

With writing, practice most definitely makes perfect. I sucked at writing at first, but I quickly improved.

A lot of people say they "don't have time to write." To be blunt, I think comments like this are the result of laziness and self-delusion. Writing a blog is really not that much work. You really can't find a couple hours to pump out a blog post? Just occasionally, instead of going out to the bar or seeing a movie or going surfing or doing whatever it is you do for fun, try writing. The potential benefits relative to the investment are MASSIVE. I haven't even discussed all the other benefits which on their own make blogging worthwhile.

Of course, writing isn't the only thing you can do to help yourself break into Silicon Valley. But it's an enormously easy way to make yourself stand out and open yourself to opportunities you never expected.

You should follow me on Twitter here.


Principles of Software Engineering, Part 1

This is the first in a series of posts on the principles of software engineering. There's far more to software engineering than just "making computers do stuff" – while that phrase is accurate, it does not come close to describing what's involved in making robust, reliable software. I will use my experience building large scale systems to inform a first principles approach to defining what it is we do – or should be doing – as software engineers. I'm not interested in tired debates like dynamic vs. static languages – instead, I intend to explore the really core aspects of software engineering.

The first order of business is to define what software engineering even is in the first place. Software engineering is the construction of software that produces some desired output for some range of inputs. The inputs to software are more than just method parameters: they include the hardware on which it's running, the rate at which it receives data, and anything else that influences the operation of the software. Likewise, the output of software is more than just the data it emits and includes performance metrics like latency.

I think there's a distinction between programming a computer and software engineering. Programming is a deterministic undertaking: I give a computer a set of instructions and it executes those instructions. Software engineering is different. One of the most important realizations I've had is that while software is deterministic, you can't treat it as deterministic in any sort of practical sense if you want to build robust software.

Here's an anectode that, while simple, hits on a lot of what software engineering is really about. At Twitter my team operated a Storm cluster used by many teams throughout the company for production workloads. Storm depends on Zookeeper to store various pieces of state relating to Storm's operation. One of the pieces of state stored is information about recent errors in application workers. This information feeds a Storm UI which users look at to see if their applications have any errors in them (the UI also showed other things such as statistics of running applications). Whenever an error bubbles out of application code in a worker, Storm automatically reports that error into Zookeeper. If a user is suppressing the error in their application code, they can call a "reportError" method to manually add that error information into Zookeeper.

There was a serious omission in this design: we did not properly consider how that reportError method might be abused. One day we suddenly received a flood of alerts for the Storm cluster. The cluster was having serious problems and no one's application was running properly. Workers were constantly crashing and restarting.

All the errors were Zookeeper related. I looked at the metrics for Zookeeper and saw it was completely overloaded with traffic. It was very strange and I didn't know what could possibly be overloading it like that. I took a look at which Zookeeper nodes were receiving the most API calls, and it turned out almost all the traffic was coming to the set of nodes used to store errors for one particular application running on the cluster. I shut that application down and the cluster immediately went back to normal.

The question now was why that application was reporting so many errors. I took a closer look at the application and discovered that all the errors being reported were null pointer exceptions – a user had submitted an application with a bug in it causing it to throw that exception for every input tuple. In addition, the application was catching every exception, suppressing it, and manually calling reportError. This was causing reportError to be called at the same rate at which tuples were being received – which was a lot.

An unfortunate interaction between two mistakes led to a major failure of a production system. First, a user deployed buggy, sloppy code to the cluster. Second, the reportError method had an assumption in it that errors were rare and thereby the amount of traffic to that method would be inconsequential. The user's buggy code broke that assumption, overloading Zookeeper and causing a cascading failure that took down every other application on the cluster. We fixed the problem by throttling the rate at which errors could be reported to Zookeeper: errors reported beyond that rate would be logged locally but not written to Zookeeper. This made reportError robust to high traffic and eliminated the possibility for cascading failure due to abuse of that functionality.

As this story illustrates, there's a lot of uncertainty in software engineering. You think your code is correct – yet it still has bugs in it. How your software is actually used differs from the model in your head when you wrote the code. You made all sorts of assumptions while writing the software, some of which are broken in practice. Your dependencies, which you use as a black box, randomly fail due to a misunderstanding of their functional input range. The most salient feature of software engineering is the degree to which uncertainty permeates every aspect of the construction of software, from designing it to implementing it to operating it in a production environment.

Learning from other fields of engineering

It's useful to look at other forms of engineering to learn more about software engineering. Take bridge engineering, for example. The output of a bridge is a stable platform for crossing a chasm. Even though a bridge is a static structure, there are many inputs: the weight of the vehicles crossing, wind, rain, snow, the occasional earthquake, and so on. A bridge is engineered to operate correctly under certain ranges of those inputs. There's always some magnitude of an earthquake for which a bridge will not survive, and that's deemed okay because that's such a low probability event. Likewise, most bridges won't survive being hit by a missile.

Software is similar. Software operates correctly only within a certain range of inputs. Outside those inputs, it won't operate correctly, whether it's failure, security holes, or just poor performance. In my Zookeeper example, the Zookeeper cluster was hit with more traffic than it could handle, leading to application failure. Similarly, a distributed database can only handle so many hardware failures in a short amount of time before failing in some respect, like losing data. That's fine though, because you tune the replication factor until the probability of such an event is low enough.

Another useful field to look at is rocket engineering. It takes a lot of failure and iteration to build a rocket that works. SpaceX, for example, had three failed rocket launch attempts before they finally reached orbit. The cause of failure was always something unexpected, some series of inputs that the engineers didn't account for. Rockets are filled to the brim with telemetry so that failures can be understood, learned from, and fixed. Each failure lets the engineers understand the input ranges to the rocket a little better and better engineer the rocket to handle a greater and greater part of the input space. A rocket is never finished – you never know when there will be some low probability series of inputs you haven't experienced yet that will lead to failure. STS-107 was the 113th launch of the Space Shuttle, yet it ended in disaster.

Software is very similar. Making software robust is an iterative process: you build and test it as best you can, but inevitably in production you'll discover new areas of the input space that lead to failure. Like rockets, it's crucial to have excellent monitoring in place so that these issues can be diagnosed. Over time, the uncertainty in the input space goes down, and software gets "hardened". SQL injection attacks and viruses are great examples of things that take advantage of software that operates incorrectly for part of its input space.

There's always going to be some part of the input space for which software fails – as an engineer you have to balance the probabilities and cost tradeoffs to determine where to draw that line. For all of your dependencies, you better understand the input ranges for which the dependencies operate within spec and design your software accordingly.

Sources of uncertainty in software

There are many sources of uncertainty in software. The biggest is that we just don't know how to make perfect software: bugs can and will be deployed to production. No matter how much testing you do, bugs will slip through. Because of this fact of software development, all software must be viewed as probabilistic. The code you write only has some probability of being correct for all inputs. Sometimes seemingly minor failures will interact in ways that lead to much greater failures like in my Zookeeper example.

Another source of uncertainty is the fact that humans are involved in running software in production. Humans make mistakes – almost every software engineer has accidentally deleted data from a database at some point. I've also experienced many episodes where an engineer accidentally launched a program that overloaded a critical internal service, leading to cascading failures.

Another source of uncertainty is what functionality your software should even have – very rarely are the specs fully understood and fleshed out from the get go. Instead you have to learn as you go, iterate, and refine. This has huge implications on how software should be constructed and creates tension between the desire to create reusable components and the desire to avoid wasted work.

There's uncertainty in all the dependencies you're using. Your dependencies will have bugs in them or will have unexpected behavior for certain inputs. The first time I hit a file handle limit error on Linux is an example of not understanding the limits of a dependency.

Finally, another big source of uncertainty is not understanding the range of inputs your software will see in production. This leads to anything from incorrect functionality to poor performance to security holes like injection or denial of service attacks.

This is by no means an exhaustive overview of sources of uncertainty in software, but it's clear that uncertainty permeates all of the software engineering process.

Engineering for uncertainty

You can do a much better job building robust software by being cognizant of the uncertain nature of software. I've learned many techniques over the years on how to design software better given the inherent uncertainties. I think these techniques should be part of the bread and butter skills for any software engineer, but I think too many engineers fall under the "software is deterministic" reasoning trap and fail to account for the implications of unexpected events happening in production.

Minimize dependencies

One technique for making software more robust is to minimize what your software depends on – the less that can go wrong, the less that will go wrong. Minimizing dependencies is more nuanced than just not depending on System X or System Y, but also includes minimizing dependencies on features of systems you are using.

Storm's usage of Zookeeper is a good example of this. The location of all workers throughout the cluster is stored in Zookeeper. When a worker gets reassigned, other workers must discover the new location as quickly as possible so that they can send messages to the correct place. There are two ways for workers to do this discovery, either via the pull method or the push method. In the pull method, workers periodically poll Zookeeper to get the updated worker locations. In the push method, a Zookeeper feature called "watches" is used for Zookeeper to send the information to all workers whenever the locations change. The push method immediately propogates the information, making it faster than the pull method, but it introduces a dependency on another feature of Zookeeper.

Storm uses both methods to propogate the worker location information. Every few seconds, Storm polls for updated worker information. In addition to this, Storm uses Zookeeper watches as an optimization to try to get the location information as fast as possible. This design ensures that even if the Zookeeper watch feature fails to work, a worker will still get the correct location information (albeit a bit slower in that particular instance). So Storm is able to take advantage of the watch feature without being fundamentally dependent on it. Most of the time the watch feature will work correctly and information will propogate quickly, but in the case that watches fail Storm will still work. This design turned out to be farsighted, as there was a serious bug in watches that would have affected Storm.

There's always a tradeoff between minimizing dependencies and minimizing the amount of code you need to produce to implement your application. In this case, doing the dual approach to location propogation was a good approach because it was a very small amount of code to achieve independence from that feature. On the other hand, removing Zookeeper as a dependency completely would not have been a good idea, as replicating that functionality would have been a huge amount of work and less reliable than just using a widely-used open-source project.

Lessen probability of cascading failures

A cascading failure is one of the worst things that can happen in production – when it happens it feels like the world is falling apart. One of the most common causes of cascading failures in my experience are accidental denial of service attacks like in my reportError example. The ultimate cause in these cases is a failure to respect the functional input range for components in your system. You can greatly reduce cascading failures by making interactions between components in your system explicitly respect those input ranges by using self-throttling to avoid accidental DOS'ng. This is the approach I used in my reportError example.

Another great technique for avoiding cascading failures is to isolate your components as much as possible and take away the ability for different components to affect each other. This is often easier said than done, but when possible it is a very useful technique.

Measure and monitor

When something unexpected happens in production, it's critical to have thorough monitoring in place so that you can figure out what happened. As software hardens more and more, unexpected events will get more and more infrequent and reproducing those events will become harder and harder. So when one of those unexpected events happens, you want as much data about the event as possible.

Software should be designed from the start to be monitored. I consider the monitoring aspects of software just as important as the functionality of the software itself. And everything should be measured – latencies, throughput stats, buffer sizes, and anything else relevant to the application. Monitoring is the most important defense against software's inherent uncertainty.

In the same vein, it's important to do measurements of all your components to gain an understanding of their functional input ranges. What throughputs can each component handle? How is latency affected by more traffic? How can you break those components? Doing this measurement work isn't glamorous but is essential to solid engineering.


Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs. Recognizing and planning for these uncertainties will make your software more reliable – and make you a better engineer.

You should follow me on Twitter here.


My new startup

There's been a lot of speculation about what my new startup is doing, so I've decided to set the record straight and reveal all. We are working on one of the biggest problems on Earth, a problem that affects nearly every person on this planet. Our products will significantly improve the quality of life for billions of people.

We are going to revolutionize the bedsheet industry.

Think about it. There's been almost no innovation in bedsheets in thousands of years. There's nothing worse than waking up to discover one of the corners of your Egyptian cotton fitted sheets has slipped off the mattress. How is this not a solved problem yet? Why are we still using sheets with that annoying elastic in it to secure them to our mattresses? They slip all the time – and if you have a deep mattress, good luck finding sheets that even fit. You're just screwed.

Consider the impact of solving this problem, of a bedsheet product that never slips, that always stays secure on the mattress. This translates to better sleep, to less grogginess in the morning, to feeling more upbeat in the morning. This translates into less morning arguments between husbands and wives that spiral into divorces, child custody battles, and decades of trauma for the children.

Not only is this a big problem – it's a big opportunity. We've done extensive market research and discovered that our target market is the entire human population. At 7 billion people and an estimated average sale of $20 per sheet, this is at least a $140,000,000,000 opportunity.

We are going to solve this problem using modern, 21st century techniques and take bedsheets out of the Stone Age and into the future. We are going to make it possible to attach your sheets to your bed, completely solving the problem.

Solving this problem in a practical and cost-efficient way is not easy and will require significant engineering breakthroughs. If you're a world-class, rock star by day and ninja by night engineer who's as passionate about bedsheets as I am, please get in touch. I'd love to talk to you.

Bedsheets have been my true passion since I was a child. I'm excited to finally be focused on what I really care about, and I can't wait until the day when untucked sheets are a curious relic of the past.


Leaving Twitter

Yesterday was my last day at Twitter. I left to start my own company. What I'll be working on is very exciting (though I'm keeping it secret for now).

Leaving Twitter was a tough decision. I worked with a whole bunch of great people on fascinating problems with some of the most interesting data in the world. Ultimately though, I felt that if I didn't make this move, I would regret it for the rest of my life. So I put in my papers about a month ago and then spent a month transitioning my team for my departure.

This ends an eventful three years that started with me joining BackType in January of 2010. So much has happened in these past three years. I open-sourced Cascalog, ElephantDB, and Storm, started writing a book, gave a lot of talks, and in July of 2011 experienced the thrill of being acquired. My projects spread beyond BackType and Twitter to be relied on by dozens and dozens of companies. Through all this, I learned an enormous amount about entrepreneurship, product development, marketing, recruiting, and project management.

Stay tuned.


Storm's 1st birthday

Storm was open-sourced exactly one year ago today. It's been an action-packed year for Storm, to say the least. Here's some of the exciting stuff that's happened over the past year:

  • 27 companies have publicized that they're using Storm in production. I know of at least a few more companies using it that haven't published anything yet.
  • O'Reilly published a book on Storm.
  • The Storm mailing list has over 1300 members, with over 500 messages per month.
  • The @stormprocessor account has over 1200 followers.
  • More than 4000 people have starred the project on Github.
  • There's a regular Storm meetup in the Bay Area with over 230 members. I've also seen lots of Storm-focused meetups happen all over the world over the past year.
  • 29 people all over the world have contributed to the codebase
  • We released Trident, a high level abstraction for realtime computation, that is a major leap forward in what's possible in realtime.
  • Libraries have been released integrating Storm with Kestrel, Kafka, JMS, Cassandra, Memcached, and many more systems. For many, Storm is becoming the system of choice for connecting these systems together.
  • Storm's performance has been increased by over 10x. I've benchmarked it at 1M messages per second per node on an internal Twitter cluster.

What I overwhelmingly hear from people is that they like Storm because it's simple to understand, flexible, and extremely robust in production. These have always been some of the core design goals of Storm, so I'm glad that we were able to succeed on these points.

We've got lots of exciting stuff planned over the next year. We have a new metrics system in development which will let you get deep insight into what's happening throughout your topology in realtime. And we have big plans for improving Trident and integrating it with more datastores and input sources.

Happy birthday Storm!


Suffering-oriented programming

Someone asked me an interesting question the other day: "How did you justify taking such a huge risk on building Storm while working on a startup?" (Storm is a realtime computation system). I can see how from an outsider's perspective investing in such a massive project seems extremely risky for a startup. From my perspective, though, building Storm wasn't risky at all. It was challenging, but not risky.

I follow a style of development that greatly reduces the risk of big projects like Storm. I call this style "suffering-oriented programming." Suffering-oriented programming can be summarized like so: don't build technology unless you feel the pain of not having it. It applies to the big, architectural decisions as well as the smaller everyday programming decisions. Suffering-oriented programming greatly reduces risk by ensuring that you're always working on something important, and it ensures that you are well-versed in a problem space before attempting a large investment.

I have a mantra for suffering-oriented programming: "First make it possible. Then make it beautiful. Then make it fast."

First make it possible

When encountering a problem domain with which you're unfamiliar, it's a mistake to try to build a "general" or "extensible" solution right off the bat. You just don't understand the problem domain well enough to anticipate what your needs will be in the future. You'll make things generic that needn't be, adding complexity and wasting time.

It's better to just "hack things out" and be very direct about solving the problems you have at hand. This allows you to get done what you need to get done and avoid wasted work. As you're hacking things out, you'll learn more and more about the intricacies of the problem space.

The "make it possible" phase for Storm was one year of hacking out a stream processing system using queues and workers. We learned about guaranteeing data processing using an "ack" protocol. We learned to scale our realtime computations with clusters of queues and workers. We learned that sometimes you need to partition a message stream in different ways, sometimes randomly and sometimes using a hash/mod technique that makes sure the same entity always goes to the same worker.

We didn't even know we were in the "make it possible" phase. We were just focused on building our products. The pain of the queues and workers system became acute very quickly though. Scaling the queues and workers system was tedious, and the fault-tolerance was nowhere near what we wanted. It was evident that the queues and workers paradigm was not at the right level of abstraction, as most of our code had to do with routing messages and serialization and not the actual business logic we cared about.

At the same time, developing our product drove us to discover new use cases in the "realtime computation" problem space. We built a feature for our product that would compute the reach of a URL on Twitter. Reach is the number of unique people exposed to a URL on Twitter. It's a difficult computation that can require hundreds of database calls and tens of millions of impressions to distinct just for one computation. Our original implementation that ran on a single machine would take over a minute for hard URLs, and it was clear that we needed a distributed system of some sort to parallelize the computation to make it fast.

One of the key realizations that sparked Storm was that the "reach problem" and the "stream processing" problem could be unified by a simple abstraction.

Then make it beautiful

You develop a "map" of the problem space as you explore it by hacking things out. Over time, you acquire more and more use cases within the problem domain and develop a deep understanding of the intricacies of building these systems. This deep understanding can guide the creation of "beautiful" technology to replace your existing systems, alleviate your suffering, and enable new systems/features that were too hard to build before.

The key to developing the "beautiful" solution is figuring out the simplest set of abstractions that solve the concrete use cases you already have. It's a mistake to try to anticipate use cases you don't actually have or else you'll end up overengineering your solution. As a rule of thumb, the bigger the investment you're trying to make, the deeper you need to understand the problem domain and the more diverse your use cases need to be. Otherwise you risk the second-system effect.

"Making it beautiful" is where you use your design and abstraction skills to distill the problem space into simple abstractions that can be composed together. I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you're looking for the simplest curve that fits those points (a set of abstractions).

The more use cases you have, the better you'll be able to find the right curve to fit those points. If you don't have enough points, you're likely to either overfit or underfit the graph, leading to wasted work and overengineering.

A big part of making it beautiful is understanding the performance and resource characteristics of the problem space. This is one of the intricacies you learn in the "making it possible" phase, and you should take advantage of that learning when designing your beautiful solution.

With Storm, I distilled the realtime computation problem domain into a small set of abstractions: streams, spouts, bolts, and topologies. I devised a new algorithm for guaranteeing data processing that eliminated the need for intermediate message brokers, the part of our system that caused the most complexity and suffering. That both stream processing and reach, two very different problems on the surface, mapped so elegantly to Storm was a strong indicator that I was onto something big.

I took additional steps to acquire more use cases for Storm and validate my designs. I canvassed other engineers to learn about the particulars of the realtime problems they were dealing with. I didn't just ask people I knew. I also tweeted out that I was working on a new realtime system and wanted to learn about other people's use cases. This led to a lot of interesting discussions that educated me more on the problem domain and validated my design ideas.

Then make it fast

Once you've built out your beautiful design, you can safely invest time in profiling and optimization. Doing optimization too early will just waste time, because you still might rethink the design. This is called premature optimization.

"Making it fast" isn't about the high level performance characteristics of a system. The understanding of those issues should have been acquired in the "make it possible" phase and designed for in the "make it beautiful" phase. "Making it fast" is about micro-optimizations and tightening up the code to be more resource efficient. So you might worry about things like asymptotic complexity in the "make it beautiful" phase and focus on the constant-time factors in the "make it fast" phase.

Rinse and repeat

Suffering-oriented programming is a continuous process. The beautiful systems you build give you new capabilities, which allow you to "make it possible" in new and deeper areas of the problem space. This feeds learning back to the technology. You often have to tweak or add to the abstractions you've already come up with to handle more and more use cases.

Storm has gone through many iterations like this. When we first started using Storm, we discovered that we needed the capability to emit multiple, independent streams from a single component. We discovered that the addition of a special kind of stream called the "direct stream" would allow Storm to process batches of tuples as a concrete unit. Recently I developed "transactional topologies" which go beyond Storm's at-least-once processing guarantee and allow exactly-once messaging semantics to be achieved for nearly arbitrary realtime computation.

By its nature, hacking things out in a problem domain you don't understand so well and constantly iterating can lead to some sloppy code. The most important characteristic of a suffering-oriented programmer is a relentless focus on refactoring. This is critical to prevent accidental complexity from sabotaging the codebase.


Use cases are everything in suffering-oriented programming. They're worth their weight in gold. The only way to acquire use cases is through gaining experience through hacking.

There's a certain evolution most programmers go through. You start off struggling to get things to work and have absolutely no structure to your code. Code is sloppy and copy/pasting is prevalent. Eventually you learn about the benefits of structured programming and sharing logic as much as possible. Then you learn about making generic abstractions and using encapsulation to make it easier to reason about systems. Then you become obsessed with making all your code generic, with making things extensible to future-proof your programs.

Suffering-oriented programming rejects that you can effectively anticipate needs you don't currently have. It recognizes that attempts to make things generic without a deep understanding of the problem domain will lead to complexity and waste. Designs must always be driven by real, tangible use cases.

You should follow me on Twitter here.


Early access edition of my book is available

The early access edition of my book Big Data: principles and best practices of scalable realtime data systems is now available from Manning! I've been working on this book for quite some time, and I'm excited to have it out there and start getting some feedback.

The interest in the book has already been overwhelming, and I've been answering questions about it on Hacker News.


How to beat the CAP theorem

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can't sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn't available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn't in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you've ever used a product that told you to "try again later", you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is "eventual consistency". If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called "read repair").

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I'm saying that you're damned if you do and damned if you don't. The CAP theorem is a fact of nature, so what alternative can there possibly be?

There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems. The complexity caused by the CAP theorem is a symptom of fundamental problems in how we approach building data systems. Two problems stand out in particular: the use of mutable state in databases and the use of incremental algorithms to update that state. It is the interaction between these problems and the CAP theorem that causes complexity.

In this post I'll show the design of a system that beats the CAP theorem by preventing the complexity it normally causes. But I won't stop there. The CAP theorem is a result about the degree to which data systems can be fault-tolerant to machine failure. Yet there's a form of fault-tolerance that's much more important than machine fault-tolerance: human fault-tolerance. If there's any certainty in software development, it's that developers aren't perfect and bugs will inevitably reach production. Our data systems must be resilient to buggy programs that write bad data, and the system I'm going to show is as human fault-tolerant as you can get.

This post is going to challenge your basic assumptions on how data systems should be built. But by breaking down our current ways of thinking and re-imagining how data systems should be built, what emerges is an architecture more elegant, scalable, and robust than you ever thought possible.

Click to read more ...


My talks at POSSCON

Last week I went to POSSCON in Columbia, South Carolina. It was an interesting experience and a good reminder that not everyone in the world thinks like we do in Silicon Valley.

I gave two talks at the conference. One was a technical talk about how to build realtime Big Data systems, and the other was a non-technical talk about the things we do at BackType to be a super-productive team. Both slide decks are embedded below.


Inglourious Software Patents

Most articles arguing for the abolishment of software patents focus on how so many software patents don't meet the "non-obvious and non-trivial" guidelines for patents. The problem with this approach is that the same argument could be used to advocate for reform in how software patents are evaluated rather than the abolishment of software patents altogether.

Software patents should be abolished though, and I'm going to show this with an economic analysis. We'll see that even non-obvious and non-trivial software patents should never be granted as they can only cause economic loss.

Why do patents exist in the first place?

The patent system exists to provide an incentive for innovation where that incentive would not have existed otherwise.

Imagine you're an individual living in the 19th century. Let's say the patent system does not exist and you have an idea to make a radically better kind of sewing machine. If you invested the time to develop your idea into a working invention, the existing sewing machine companies would just steal your design and crush you in the marketplace. They have massive distribution and production advantages that you wouldn't be able to compete with. You wouldn't be able to monetize the initial investment you made into developing that invention. Therefore, you wouldn't have invented the radically better sewing machine in the first place.

From this perspective, patents are actually a rather clever hack on society to encourage innovation. By excluding others from using your invention for a fixed amount of time, you get a temporary monopoly on your invention. This lets you monetize your invention which makes your initial investment worthwhile. This in turn benefits society as a whole, as now society has inventions that it wouldn't have had otherwise.

The patent system does not exist to protect intellectual property as a goal unto itself. If the incentive to create the innovation was there without the patent system, then the patent system is serving no purpose.

After all, there is a cost to the patent system. There's no hard and fast way to determine whether an invention required the promise of a patent for its creation, so inevitably some patents will be awarded to inventions that would have been created anyway. The patent system creates monopolies out of these inventions that would have existed in a competitive marketplace otherwise. These are "accidental monopolies" in the sense that they are unintended consequences of a patent system trying to encourage innovation that wouldn't have occurred without the patent system.

Accidental monopolies are the cost of the patent system. For the patent system to be worthwhile, the amount of benefit from inventions that wouldn't have existed otherwise should exceed the cost of accidental monopolies. The purpose of the "non-obvious and non-trivial" guidelines is to try to minimize the number of patents awarded that create accidental monopolies.

Innovation in software

Patents are not necessary for innovation to occur in software. You'll have a hard time finding many examples of software innovations that wouldn't have been made without the promise of a patent. This means that every software patent creates an accidental monopoly.

A good place to look at the importance of patents to software innovation is startups. Startups must innovate if they want to become sustainable businesses. The question is -- do patents encourage innovation in startups by protecting them from having their ideas stolen?

Quite the opposite. Software startups are thriving nowadays in spite of software patents rather than because of them. Instead of helping startups get off the ground, patents are a cost. Startups must build "defensive patent portfolios" and worry about getting sued by patent trolls or businesses trying to entrench their position. Instead of patents being a protective shield for a startup, they're instead a weapon that causes economic waste.

It's hard for a big company to just steal a software idea. Being big just isn't the advantage in the software industry as it was in our sewing machine example. They don't have the same production and distribution advantages since the internet makes it cost practically nothing to distribute software. Furthermore, it's not that easy to just copy a software product. Look at what happened with Google Buzz.

At my company BackType, we're doing a lot of innovative things with Big Data systems. Rather than try to patent our ideas and achieve an exclusive monopoly on what we invent, we're doing the opposite. We're sharing these inventions with the world by open-sourcing them. We do this because it helps so much with recruiting: it establishes us as a serious technology company, and programmers want to work at companies where they can contribute to open source.

Vivek Wadwha has a good post showing the stats on how counter-productive patents are in the software industry.

Since massive amounts of software innovation occurs in spite of the patent system, software patents are irrational.

The openness argument

Another purported reason for the existence of patents is that it encourages inventions to be shared rather than be kept secret. This ensures the invention enters the public domain and prevents the invention from ever being lost.

This argument doesn't hold with software. Just look at the facts:

  • The academic community publishes their innovations to the public.
  • There is a massive and rapidly growing amount of innovative open source software.
  • Companies have strong incentives to participate in open source.

When I'm looking for innovative software approaches, I search the Internet or I look at research papers. I never look at software patents, and I don't know anyone in the software industry who would.


The fashion industry is an excellent example of an industry that has no patents and thrives.

Even non-obvious and non-trivial software ideas should not be patentable, because the promise of a patent is not necessary for innovation in software. The economics are clear: software patents should be abolished.

You should follow me on Twitter here.