Early access edition of my book is available

The early access edition of my book Big Data: principles and best practices of scalable realtime data systems is now available from Manning! I've been working on this book for quite some time, and I'm excited to have it out there and start getting some feedback.

The interest in the book has already been overwhelming, and I've been answering questions about it on Hacker News.


How to beat the CAP theorem

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can't sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn't available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn't in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you've ever used a product that told you to "try again later", you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is "eventual consistency". If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called "read repair").

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I'm saying that you're damned if you do and damned if you don't. The CAP theorem is a fact of nature, so what alternative can there possibly be?

There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems. The complexity caused by the CAP theorem is a symptom of fundamental problems in how we approach building data systems. Two problems stand out in particular: the use of mutable state in databases and the use of incremental algorithms to update that state. It is the interaction between these problems and the CAP theorem that causes complexity.

In this post I'll show the design of a system that beats the CAP theorem by preventing the complexity it normally causes. But I won't stop there. The CAP theorem is a result about the degree to which data systems can be fault-tolerant to machine failure. Yet there's a form of fault-tolerance that's much more important than machine fault-tolerance: human fault-tolerance. If there's any certainty in software development, it's that developers aren't perfect and bugs will inevitably reach production. Our data systems must be resilient to buggy programs that write bad data, and the system I'm going to show is as human fault-tolerant as you can get.

This post is going to challenge your basic assumptions on how data systems should be built. But by breaking down our current ways of thinking and re-imagining how data systems should be built, what emerges is an architecture more elegant, scalable, and robust than you ever thought possible.

Click to read more ...


My talks at POSSCON

Last week I went to POSSCON in Columbia, South Carolina. It was an interesting experience and a good reminder that not everyone in the world thinks like we do in Silicon Valley.

I gave two talks at the conference. One was a technical talk about how to build realtime Big Data systems, and the other was a non-technical talk about the things we do at BackType to be a super-productive team. Both slide decks are embedded below.


Inglourious Software Patents

Most articles arguing for the abolishment of software patents focus on how so many software patents don't meet the "non-obvious and non-trivial" guidelines for patents. The problem with this approach is that the same argument could be used to advocate for reform in how software patents are evaluated rather than the abolishment of software patents altogether.

Software patents should be abolished though, and I'm going to show this with an economic analysis. We'll see that even non-obvious and non-trivial software patents should never be granted as they can only cause economic loss.

Why do patents exist in the first place?

The patent system exists to provide an incentive for innovation where that incentive would not have existed otherwise.

Imagine you're an individual living in the 19th century. Let's say the patent system does not exist and you have an idea to make a radically better kind of sewing machine. If you invested the time to develop your idea into a working invention, the existing sewing machine companies would just steal your design and crush you in the marketplace. They have massive distribution and production advantages that you wouldn't be able to compete with. You wouldn't be able to monetize the initial investment you made into developing that invention. Therefore, you wouldn't have invented the radically better sewing machine in the first place.

From this perspective, patents are actually a rather clever hack on society to encourage innovation. By excluding others from using your invention for a fixed amount of time, you get a temporary monopoly on your invention. This lets you monetize your invention which makes your initial investment worthwhile. This in turn benefits society as a whole, as now society has inventions that it wouldn't have had otherwise.

The patent system does not exist to protect intellectual property as a goal unto itself. If the incentive to create the innovation was there without the patent system, then the patent system is serving no purpose.

After all, there is a cost to the patent system. There's no hard and fast way to determine whether an invention required the promise of a patent for its creation, so inevitably some patents will be awarded to inventions that would have been created anyway. The patent system creates monopolies out of these inventions that would have existed in a competitive marketplace otherwise. These are "accidental monopolies" in the sense that they are unintended consequences of a patent system trying to encourage innovation that wouldn't have occurred without the patent system.

Accidental monopolies are the cost of the patent system. For the patent system to be worthwhile, the amount of benefit from inventions that wouldn't have existed otherwise should exceed the cost of accidental monopolies. The purpose of the "non-obvious and non-trivial" guidelines is to try to minimize the number of patents awarded that create accidental monopolies.

Innovation in software

Patents are not necessary for innovation to occur in software. You'll have a hard time finding many examples of software innovations that wouldn't have been made without the promise of a patent. This means that every software patent creates an accidental monopoly.

A good place to look at the importance of patents to software innovation is startups. Startups must innovate if they want to become sustainable businesses. The question is -- do patents encourage innovation in startups by protecting them from having their ideas stolen?

Quite the opposite. Software startups are thriving nowadays in spite of software patents rather than because of them. Instead of helping startups get off the ground, patents are a cost. Startups must build "defensive patent portfolios" and worry about getting sued by patent trolls or businesses trying to entrench their position. Instead of patents being a protective shield for a startup, they're instead a weapon that causes economic waste.

It's hard for a big company to just steal a software idea. Being big just isn't the advantage in the software industry as it was in our sewing machine example. They don't have the same production and distribution advantages since the internet makes it cost practically nothing to distribute software. Furthermore, it's not that easy to just copy a software product. Look at what happened with Google Buzz.

At my company BackType, we're doing a lot of innovative things with Big Data systems. Rather than try to patent our ideas and achieve an exclusive monopoly on what we invent, we're doing the opposite. We're sharing these inventions with the world by open-sourcing them. We do this because it helps so much with recruiting: it establishes us as a serious technology company, and programmers want to work at companies where they can contribute to open source.

Vivek Wadwha has a good post showing the stats on how counter-productive patents are in the software industry.

Since massive amounts of software innovation occurs in spite of the patent system, software patents are irrational.

The openness argument

Another purported reason for the existence of patents is that it encourages inventions to be shared rather than be kept secret. This ensures the invention enters the public domain and prevents the invention from ever being lost.

This argument doesn't hold with software. Just look at the facts:

  • The academic community publishes their innovations to the public.
  • There is a massive and rapidly growing amount of innovative open source software.
  • Companies have strong incentives to participate in open source.

When I'm looking for innovative software approaches, I search the Internet or I look at research papers. I never look at software patents, and I don't know anyone in the software industry who would.


The fashion industry is an excellent example of an industry that has no patents and thrives.

Even non-obvious and non-trivial software ideas should not be patentable, because the promise of a patent is not necessary for innovation in software. The economics are clear: software patents should be abolished.

You should follow me on Twitter here.


Cascalog workshop

I'll be teaching a Cascalog workshop on February 19th at BackType HQ in Union Square. You can sign up at Early bird tickets are available until January 31st.

I'm very excited to be teaching this workshop. Cascalog's tight integration with Clojure opens up a world of techniques that no other data processing tool is able to do. Even though I created Cascalog, I've been discovering many of these techniques as I've made use of Cascalog for more and more varied tasks. Along the way, I've tweaked Cascalog so that making use of these techniques would be cleaner and more idiomatic. At this point, after nine months of iteration, Cascalog is a joy to use for even the most complex tasks. I'm excited to impart this knowledge upon others in this workshop.


Analysis of the #LessAmbitiousMovies Twitter Meme

We did a fun post on the BackType blog today analyzing a meme that took off on Twitter this week. A person with about 500 followers started the meme that eventually reached more than 27 million people. Check out our analysis here, and you can check out TechCrunch coverage of our analysis here.

Doing the analysis was relatively simple. We extracted an 80 MB dataset of the tweets involved in the meme from our 25 TB social dataset. We downloaded that data to a local computer and ran queries on the data from a Clojure REPL using Cascalog. Doing the data analysis only took us a couple hours.


How to reject a job candidate without being an asshole

I used to send a job candidate an email like the following after the person failed a phone interview:

Hi [Candidate],

Thanks a lot for taking the time to interview with us. However, we've decided not to move forward in the process with you. I wish you the best of luck on your future projects.


Rejection emails like that are cold, impersonal, and hollow. It doesn't feel good to send an email like that, and it sure doesn't feel good to receive an email like that.

You want a candidate to feel good about your company even after failing the interview process. You don't want the candidate to discourage their friends from interviewing with your company, and ideally you want the candidate to refer their friends to your company.

Now I tell candidates on the spot whether they pass or fail at the end of the phone interview. I give them feedback on what they did well and what they did poorly. I'm very candid with them.

The first time I tried this, I had butterflies in my stomach. I thought to myself, "What if he gets angry? What if he freaks out on me and starts screaming obscenities?" It turns out my fears were completely unfounded.

I've found that candidates really appreciate my candor about their rejection. They appreciate being told why they failed, especially because everyone else is so impersonal and non-specific about rejections.

When I reject a candidate, I lead with what I liked about the person. For example, I might tell the candidate how I thought the project he described was interesting. I find that leading with an honest compliment softens the blow of the impending criticism. I haven't yet been at a loss for something nice to say to a candidate, and it would be a rare candidate for that to happen.

Next, I tell the candidate why we can't move forward in the process with them. I tell the candidate what we're looking for and why I think they're not qualified. Let's face it: candidates know when they didn't do a good job. They know if they bombed the coding question you gave them. I tell the candidate that I know it's possible that we're making a mistake, that coding questions aren't always the most accurate indicators, but that it's the decision we have to make based on the results of the interview.

Finally, I give candidates a chance to give me feedback. I ask what they thought about our interview process, and I ask what I could do better as an interviewer. I've gotten some valuable feedback this way, and I'm becoming a better interviewer because of the feedback. Plus, giving candidates the opportunity to criticize me makes the process feel a lot less one-sided.

I'm really happy about my decision to reject candidates on the spot at the end of phone interviews rather than with impersonal emails. I don't feel like an asshole anymore, and candidates seem to appreciate the feedback.

You should follow me on Twitter here.


You Are a Product

I had a revelation the other day. I realized that the terms "programmer" and "employee" are inadequate to describe what I am. What I am is a product, and you are one too. If you want to develop your career, you need to approach your career as a product development problem.

You sell yourself for various things: money, status, the opportunity to work on interesting problems, good coworkers, etc. In this post I'll be referring to this as "getting paid", but please keep in mind that "getting paid" means more than just money.

Supply and Demand

Like any product, you have supply and demand. Your supply is what you can do for a company that hires you. It's your ability to make beautiful websites. It's your ability to scale a database. It's your ability to get the best work out of others. Your supply is the actual value you will provide to a company that hires you.

Your demand is what companies think you can do for them. Your demand is your perceived value by others. At the end of the day, you will be paid according to how you're perceived, not by the actual value you can produce. This is why so many 10x engineers don't actually get paid 10x -- they're not publicly perceived as 10x engineers so normal market forces are unable to bid up their value.

I see way too many people say to themselves "As long as I just put out quality work, I'll be taken care of." This is bullshit. This way of thinking prevents you from reaching your potential. It prevents you from being paid what you should be getting paid, and it prevents you from bettering your status. You cannot just focus on your supply. Supply is only 50% of the equation. You could be the greatest programmer to ever live, but if no one knows that, it won't help you. You are a product, and if you want to get paid appropriately, you have to work on your demand.

Personal branding

Influencing your demand is called "personal branding." It's marketing. Your actual value -- your supply -- is important to the extent that you can use it to raise your perceived value -- your demand.

Personal branding is inherently a public activity. Market forces rely on information being public. You want lots of people to believe that you can provide them with lots of value. This will lead to opportunities for you. Many of these opportunities will be out of the blue and unexpected.

There are lots of things you can do to increase your demand. Start a blog and promote it through Twitter and social news sites. Speak at conferences. Build social proof by building up your Twitter follower list. Participate in open source projects and write blog posts about the work you're doing on the projects.

I think open source is the best activity a programmer can engage in. It makes public your actual ability to solve problems and write code. You should strongly prefer to work at companies with a culture of making and contributing to open source projects, as that gives you the opportunity to market yourself.

I think the best personal branding activities are rooted in the actual value you can provide to others. There are other activities that can increase your demand, like taking credit for the work of others, that are flat out unethical. Don't be a product that's just smoke and mirrors.

Marketing yourself takes work, but it's something that gets easier with practice. It would be stupid to release a product to market without promoting and marketing it. Likewise, you should treat yourself as a product and market yourself as such. When you do, you can watch the forces of supply and demand work their magic.

You should follow me on Twitter here.


The time I hacked my high school

When I was in high school, I started the Chess Club. I needed money to buy chess sets and chess clocks to get the club going, but at first I had some difficulty raising cash.

Then I hacked the system, and Chess Club became a cash generating machine.

Before the hack

Clubs made money by reselling burritos or pizza from nearby restaurants during lunch. Each of these lunch sales typically made about $100 in profit. Since I didn't want to charge dues for the club, I needed lunch sales to raise money for Chess Club.

Unfortunately, the rules around lunch sales were restrictive. Only one club could sell per week, and other clubs like the Science Club had a much stronger precedent for needing lunch sales. Without a precedent for needing money, I was unable to acquire enough lunch sale dates.

The hack

I studied the rules for operating clubs on campus and found the loophole I needed: clubs were allowed to go into debt to the student government for $200. I figured that if I were in debt to the student government, I'd have more leverage in getting lunch sale dates.

I immediately spent $200 on chess boards, chess clocks, and books. I bought more than we needed because I wanted to maximize our debt. Then I went to the student government treasurer, gave him the receipt, and was reimbursed for the expense.

The student government wasn't too happy about the situation. They wanted me to pay them back as they were on a tight budget. I told them I couldn't raise money because they wouldn't give me lunch sale dates.


They relented and started giving me lunch sale dates so that I could pay them back. Even though we made $100 per lunch sale, I only paid them back $50 at a time to maximize the time we were in debt. Soon afterwards, the student government relaxed the rules to let clubs have lunch sales more days per week.

Since Chess Club now had a precedent for needing money, I was able to get plenty of slots for lunch sales. The student government clearly didn't think about why we needed so much money (we didn't need so much money). They relied on the "precedent for needing money" heuristic and may have been afraid Chess Club would go into debt again.

After the hack

Chess Club was swimming in money. I stocked the school's library with chess sets. I upgraded the club's chess sets to a mixture of glass and wooden sets. I bought computerized chess sets and expanded our collection of chess books. I started holding school-wide chess tournaments with hundreds of dollars in prizes.

I couldn't spend money faster than we were bringing it in.

When I graduated, Chess Club had $500 in the bank. I considered holding one last tournament with massive prizes, but ultimately decided to leave the money for future generations of Chess Club.

You should follow me on Twitter here.


Fastest Viable Product: Investing in Speed at a Startup

A startup is like a rat in a maze searching for a piece of cheese. The cheese in the startup's case is product-market fit, that pivotal point when the startup can scale and monetize the business.

In the maze, the startup has a dazzling amount of choices of where to go. Should we build this new feature? Should we try this new idea we have for a product? Should we backtrack and completely change our idea?

Lean startups use a strategy called "Minimum Viable Products" to help navigate the maze. The idea is that a startup formulates hypotheses about what users want or do not want; each of these hypotheses is a "turn" in the maze. A "Minimum Viable Product" is the smallest test that will let the startup know whether their "turn" was a good one. A startup wants to stop going the wrong direction as early as possible.

A "Minimum Viable Product" can be anything from a working application to an SEO'd survey that will gauge interest in an idea. "Minimum Viable Products" have been written about extensively.

The term "Minimum Viable Product" is a misnomer

However, the term "Minimum Viable Product" is a misnomer. The real goal is to test hypotheses as fast as possible, and being minimal is just a side effect of being fast. "Fastest Viable Product" is a more appropriate name.

Click to read more ...