Wednesday
Feb122014

Interview with "Programmer Magazine"

I was recently interviewed for "Programmer Magazine", a Chinese magazine. The interview was published in Chinese, but a lot of people told me they'd like to see the English version of the interview. Due to the Google translation being, ahem, a little iffy, I decided to just publish the original English version on my blog. Hope you enjoy!

What drew you to programming and what was the first interesting program you wrote?

I started programming when I was 10 years old on my TI-82 graphing calculator. Initially I started programming because I wanted to make games on my calculator – and also because I was bored in math class :D. The first interesting game I made on my calculator was an archery game where you'd shoot arrows at moving targets. You'd get points for hitting more targets or completing all the targets faster. A couple years later I graduated to programming the TI-89 which was a huge upgrade in power. I remember how the TI-82 only let you have 26 variables (for the characters 'a' through 'z') and thinking how incredible it was that the TI-89 let you have as many variables as you want.

What do you do to improve your skills as a programmer?

I get better by doing a lot of programming and trying new things. One of the best ways to become a better programmer is to learn new programming languages. By learn I mean more than just learning the syntax of the language, I mean understanding the language's idioms and writing something substantial in it. For me, learning Clojure made me a much better programmer in all languages.

Could you talk about your experience before joining BackType?

I got my bachelor's and master's in Computer Science at Stanford University with a focus on software theory. So I did a lot of algorithms and proofs and so on. Probably the best thing I did at Stanford was choose classes not so much by the subject material but by the professor. When I found a professor who was a great teacher I would take as many classes with that professor as possible. For example, one of the greatest teachers I've ever had is Professor Tim Roughgarden. I took a bunch of "algorithmic game theory" classes with him – algorithmic game theory is basically the intersection of economics and computer science. I took the classes not so much for the material but to improve my problem solving skills. Professor Roughgarden had an incredibly coherent and disciplined way of breaking down extremely difficult problems and making them easy to understand. Learning those skills has made me a much better problem solver in all scenarios, as well as being a much better communicator of difficult concepts.

You once said, leaving Twitter is a tough decision, could you please tell us why you decide start your own company? What object do you want to achieve?

I had a pretty great situation at Twitter, having my own team and working full-time on a project I started. But when I thought of the idea for my company, it was so compelling I just couldn't stop thinking about it. So I felt that if I didn't start this company, I would regret it for the rest of my life.

What are the main lessons you learned in the last few years of your professional career?

Feedback is everything. Most of the time you're wrong, and feedback is the only way to realize your mistakes and help you become less wrong. This applies to everything. In product development, get your product out there as soon as possible so you can get feedback and see what works and what doesn't. In many cases you don't even need to build anything – a link to a "feature" that actually goes to a survey page can give you the feedback you need to test your idea.

In managing a team, it's really important to have feedback on all the processes you do. At BackType we'd have a once a month meeting to discuss our processes and whether they are effecive or too restrictive. This caused us to introduce standups, and then remove standups when we didn't feel they were that useful to us. We used that process to go from monthly meetings to biweekly meetings to weekly meetings, then back to biweekly meetings.

In your blog you said "I'm always happy to give advice or connect with people doing interesting work", what interesting projects you have seen, and which suggestions you provide?

The founder of Insight Data Science approached me when he was starting it, and I think it's an absolutely terrific program. They provide a 6 week bootcamp to help math/science/physics PhD's learn programming skills so they can start a career in data science. Basically the program recognizes that there is a surplus of very smart people who don't necessarily have the most interesting job prospects, while there is a booming tech industry with a huge talent shortage of data scientists. So they bridge that gap. I was able to help them out with a couple things and I think their execution has been very impressive.

What prompted you to write the book Big Data and what problems you want to solve? Writing a book will take a long process, what you have learned during this process?

I had developed a lot of theory and best practices about architecting big data systems that no one else was talking about. People were focused on very specific use cases, whereas I had developed rigorous, holistic approaches. A lot of the things I talk about, like being resilient to human error (something I consider to be absolutely non-negotiable) are ignored by the vast majority of industry. I think the industry will be much better off by building these systems more rigorously and less haphazardly, and I felt that this book was the right way to effect that change.

I knew that writing a book would be a lot of work, but it turned about to be signicantly more work than I expected. I think my book is especially challenging because it's such a huge subject. At one point I had half the book written, but I realized I was taking the wrong approach to communicating the material so I scrapped everything and started over. It was definitely worth it though because based on the feedback I get from readers they love the material and really get what I'm trying to communicate.

My editors have been absolutely invaluable in the writing process and have helped me become a much better writer. I've learned that the way I was taught in school to write is actually the complete opposite of effective communication. I was taught to make your general "thesis" statement up front, and then drill down into that general statement with supporting points and eventually specific details. It turns out that this forces the reader to do a lot of work to synthesize everything you're saying. They won't grasp the thesis up front – because they haven't read the supporting points yet. So after drilling down the reader now has to drill back "up" to connect everything. It's a convoluted way to achieve an understanding of something. A much better way to communicate is to tell a story – start with a situation the reader already understands, and then connect step by step to the ultimate general statement you want your reader to understand. Specific to general is always better than general to specific.

You contributed a lot open source projects, what makes you believe in open source?

Open source benefits so many people in so many ways. When you're a startup, you're highly resource-constrained, so being able to take advantage of the work other people have done is a godsend. Lowering the cost of doing startups, of course, is highly beneficial to society. When you benefit that much from open source, you do feel obligated to give back as well. On top of that, when you open source software as a company, you benefit from other people trying it out, finding issues, and improving your software "for free".

On a personal level, open source has given me an opportunity to interact with an entire world of developers, rather than just those in whatever company I happened to be at. This has been hugely beneficial to my career, allowing me to get to know tons of awesome people and travel the world to speak at conferences.

Which person has influenced you the most?

Philosophically I'd have to say the most influential person to me is Carl Sagan. I've read most of his books and find them hugely inspirational. I think he was one of the greatest communicators of all time, and what impresses me most about him is his extreme empathy towards his audience. For example, he has quite a bit of writing about science vs. religion – but as a scientist he is not hostile towards religion or anything like that. He understands why people are religious and the value they get out of it. So when he communicates the value of science and skepticism to religious people he starts with religion = valuable as a starting point. That degree of empathy is really rare, and it's something that I'm continuously trying to improve at. He taught me that empathy is the basis of good communication.

What's the key points John McCarthy told you about his life and perspective? How these words affect the rest of your life, what is your own life and perspective?

I talked with John McCarthy for two hours when I was a sophomore in college. The most striking thing he told me was when I asked about the history of Lisp. He told me he needed a better programming language for doing AI research, so he invented Lisp for that purpose. He really didn't seem to care that much about programming languages – his real passion was AI. It struck me as exactly like how Isaac Newton invented calculus because he needed it for his physics work. The pebbles of giants really are big boulders.

When designing a software system, which process you will use? (the fist step, the second step...)

I think designing a software system is entirely about learning what to build as you go. I use a technique which I call "suffering-oriented programming" in order to maximize learning and minimize wasted work. I detailed this approach on my blog. The general idea is to avoid making "general" or "extensible" solutions until you have a very deep understanding of the problem domain. Instead you should hack things out very directly to get a working prototype as fast as possible. Then you iterate and evolve and learn more about the problem domain. Once you have a good understanding of the intricacies of the problem domain, then you can redesign your solution to make it more general, extensible, etc. Finally, at the end, you wrap things up by tightening up the code and making performance optimizations. The sequence is "First make it possible. Then make it beautiful. Then make it fast."

Do you have any principle when you programming?

I believe strongly in immutability and referentially transparent functions as ways to vastly simplify software. Mutability creates a web of dependencies in your code – things that can change other things which then change other things – which gets hard to wrap your head around. Code is all about being able to understand what's going on, so anything you do to make that easier is a good thing. Immutability is one such technique to reduce what you need to understand about a particular piece of code to grasp it. Additionally, referentially transparent functions only depend on their arguments (and not on any other state), so they are also easier to understand.

Another important principle I live by is "my code is wrong". I think it's pretty clear that we don't know how to make perfect software – all the code I've ever used or written has had bugs in it. So I assume my code is wrong and design it to work anyway (with higher probability, at least). I've detailed techniques to accomplish this on my blog and in my conference talks this year.

Compared with your early years, what is the biggest change when you programming today?

Since I started off programming graphing calculators, I'd say the biggest change is using full-fledged keyboards to program instead of those tiny keypads :)

Storm development is in a very short time, by few developers, under a limited budget and urgent requirements, what is your secret about how could be so efficient like this?

Storm is the result of following that "suffering-oriented programming" methodology. We didn't jump into Storm out of the blue – we had been doing realtime computation at BackType for a long time by stringing workers together manually with queues. So we had a really solid understanding of our needs for realtime processing. Storm at its core is just a simple set of abstractions with a clever algorithm behind the scenes to guarantee data processing. When I thought long and hard about the realtime problems we were dealing with at the time, the design of Storm was obvious. Additionally, I had a ton of experience with Hadoop and knew some of the mistakes made in the design of that system, so I applied that learning into Storm to make it more robust.

You have interviewed a lot of programmers, what are the best programmers in common do you think?

The best programmers are obsessed with improving as programmers. They love exploring new programming languages and new ideas. Another key trait of great programmers is a "getting stuff done" mentality. It's far more important to get something working than to make the perfect design. Plus a great programmer recognizes that you can't make a perfect design without first having something working that you can learn from.

Are there any myths (layman think something right, but expert do not think so) and traps in Data System and Big Data?

Probably the biggest misconception I see is people placing the relational database, and associate concepts like CRUD, on a pedestal. People treat the RDBMS as if its the ultimate in database technology, and everyone seems to be trying to recreate the RDBMS to work in Big Data land. But this ignores massive problems that have always existed with the RDBMS: they're based on mutability so are extremely susceptible to corruption whenever there's a bug or a human error, and they force you into a horrible situation of needing to either normalize your schema and take performance hits, or denormalize your schema and create a maintenance nightmare (among other problems). When you actually look at data systems from first principles, as I do in my book, you see that there's different ways of architecting data systems that have none of these complexities.

What problem you want to solve in BackType that lead you decide to start design Storm?

There were two problems. The first was how to keep our databases containing social media analytics stats up to date in realtime in a reliable way. The second was the "reach problem" – how to compute the "reach" of a URL on Twitter very quickly. The "reach" is the unique count of all the followers of all the people who tweeted a URL. It's very computationally intensive and hard to precompute. Storm turned out to be a simple abstraction which unified these seemingly unrelated use cases.

What reason or experience make you sure that you will successfully build Storm?

The key was that we had tons of experience with realtime computation so knew the problem domain very well. So there was really no question in my mind that Storm would be successful because I had already learned a majority of the little gotchas.

Why you choose Clojure as the development language of Storm? Could you talk about your long practical experience about using this language (like its advantages and disadvantages)? Which feature won't appear in the Storm, if you were not using Clojure?

Clojure is the best language I've ever used, by far. I use it because it makes me vastly more productive by allowing me to easily use techniques like immutability and functional programming. Its dynamic nature by being Lisp-based ensures that I can always mold Clojure as necessary to formulate the best possible abstractions. Storm would not be any different if I didn't use Clojure, it just would have been far more painful to build.

From your blog, I saw you advocate a lot about writing. Could you share us what do you do in improve your writing skill?

The only way to improve at writing is to write a lot. When other people read my writing and give me feedback, like by commenting on my blog, I carefully think about where that comment came from. If they misunderstood something then that means I'm not communicating correctly – either I'm not clear or I'm not properly anticipating reader objections (whether or not those objections are fallacious is irrelevant). By understanding why my message doesn't get through, I'm able to do a better job the next time.

I also read a lot and try to learn from great writers. As I've mentioned Carl Sagan is one of my favorite writers and I've learned tons from reading him – and I continue to learn tons from him everytime I read his work.

You start using Emacs recent years, could you talk about the programming tools you choose and how they impact you?

I started using Emacs because I found it to be the best environment for programming Clojure (due to its Lisp background). I've been really impressed with how powerful of a tool it is and how much it can be customized to my needs. On top of that, since it was originally written so long ago, it has an incredibly small resource footprint. That's something I really enjoy because modern IDE's tend to be such resource hogs.

Other than that, I think my setup is pretty simplistic. I use a live REPL in my Emacs for exploratory development and interactive testing. I also have tons of text files on my computer with design notes and ideas. For my todo list I literally just use a text file.

Friday
Apr122013

Break into Silicon Valley with a blog

I know a lot of non-technical people who would love to work in the venture-funded startup world, from consultants to finance people to other business types for which I'm not really sure exactly what it is they do. They hit obstacles trying to get into the startup world, finding that their skills are either irrelevant or hard to explain. My advice to all these people is the same:

Write a blog.

A blog can improve your life in enormous ways. Or to put it in business-speak: a blog has one of the highest ROI's of anything you can do.

Put yourself in the shoes of startups looking for talent. First off – startups are desperate for talent. The problem is that it's very difficult to identify great people – startups search through loads and loads of candidates.

Resumes and interviews only tell you so much about a person. It's really hard to stand out in a resume – you're not the only one putting over-inflated impressive-looking numbers and bullet points on your resume. And interviews are notorious for labeling bad people as good and good people as bad. So to maximize your odds of making it through the funnel, you need to show that you're awesome independent from the randomness of the normal process.

One thing you can do is write an insightful blog. This makes you look a lot more compelling. Now the reaction from startups will be "Hey, this person's really smart. We don't want to miss out on a potentially great hire, so let's put in a lot of effort to determine if there's a good fit."

A new dimension of opportunity

There's another huge advantage to having a blog besides being a mechanism to show that you're smart and insightful. A blog opens up a whole new dimension of opportunity for you. Instead of relying purely on outbound opportunities that you explicitly seek out yourself, you also will get inbound opportunities where people reach out to you with opportunities you never expected or dreamed of.

With an outbound opportunity you know exactly what you're seeking, whether it's landing a job or speaking at a conference or something else. Inbound opportunities, on the other hand, are highly uncertain. They come to you out of the blue. In my personal experience, many of the most awesome things I've done started as inbound opportunities: a book deal, flying all around the world for free to speak at conferences, a keynote at a major conference, and connecting with hundreds of awesome people who have reached out to me because of something I did or wrote publicly.

When you write a blog, you greatly increase the likelihood of getting awesome inbound opportunities. When it comes to breaking into Silicon Valley – instead of everything being on your shoulders to seek interesting companies, those companies will be reaching out to you.

A great phrase I've heard for this is "increasing your luck surface area". By providing value to people publicly, like writing insightful posts on a blog, you open yourself up to serendipitous, "lucky" opportunities.

Getting readers

Besides writing smart posts, you also need people to read your writing. Here's a few tips for accomplishing that.

First off, the title of a blog post is incredibly important. In very few words, you need to sell your potential reader that your blog post is going to be worth their time. I've found the the best titles are relevant to the potential reader, somewhat mysterious, and non-generic. Titles are definitely an art form, so you should think hard about how you'll name your posts. Sometimes I wait days to publish a post because I haven't thought of a compelling enough title.

Second, I highly recommend using Twitter as a distribution platform for your blog posts. The combination of Twitter and blogging leads to a beautifully virtuous cycle: your blog increases your Twitter following, and as your Twitter following grows you increase the reach of your blog. I consider Twitter to be the greatest professional networking tool ever devised – I follow people who tweet/blog interesting things and they follow me for the same reason. Then when I go to conferences I seek out the people who I know and respect from their online presence. When we meet, we already know a lot about each other and have a lot to talk about.

Lastly, you should embrace the online communities who will care about your blog. In Silicon Valley, the most important community is Hacker News. Hacker News is widely read in Silicon Valley by programmers, entrepreneurs, and investors. It can drive a lot of readers to your blog in a short amount of time.

Initially, it may be hard for you to get readers. Getting your posts on Hacker News is very much a crapshoot, and initially you'll have too small of a Twitter following to get that much distribution. But occasionally you'll write something smart that gets on Hacker News and gets shared around. Over time as your writing and distribution improves getting readers gets easier and easier.

What to write about

If you don't think you have anything to write about, then let me ask you a question. Do you really have that low of an opinion of yourself? Do you really think you have nothing interesting that you can share with the world? There's tons of stuff that you can write about that you don't even know to share. You have a ton of knowledge that you don't realize other people don't know because you spend all your time in your own head. Tell stories of times that you hustled. Write about the dynamics of big companies. Write case studies of anything related to running a business. Analyze the market for interesting new technologies (e.g. 3D printing, Bitcoin, etc). There's so much that you can write about.

Once you start blogging, you'll become attuned to random ideas you have throughout the day that would make good blog posts. Most of my blog ideas start off as email reminders to myself.

Final thoughts

If you haven't blogged before, you're going to suck at first. Being accurate, precise, and insightful is not enough. You have to learn how to hook people into your posts and keep the post engaging. You'll learn about the glorious world of internet commenting, where people constantly misinterpret what you say and apply very fallacious reasoning to your posts. You'll see people trash your ideas on Hacker News even though it's clear they didn't read your entire post. Sometimes they comment having only read the title! You'll learn over time different ways to structure the same information in order to minimize misinterpretation. You'll learn to anticipate fallacious reasoning and preemptively address those fallacies.

With writing, practice most definitely makes perfect. I sucked at writing at first, but I quickly improved.

A lot of people say they "don't have time to write." To be blunt, I think comments like this are the result of laziness and self-delusion. Writing a blog is really not that much work. You really can't find a couple hours to pump out a blog post? Just occasionally, instead of going out to the bar or seeing a movie or going surfing or doing whatever it is you do for fun, try writing. The potential benefits relative to the investment are MASSIVE. I haven't even discussed all the other benefits which on their own make blogging worthwhile.

Of course, writing isn't the only thing you can do to help yourself break into Silicon Valley. But it's an enormously easy way to make yourself stand out and open yourself to opportunities you never expected.

You should follow me on Twitter here.

Tuesday
Apr022013

Principles of Software Engineering, Part 1

This is the first in a series of posts on the principles of software engineering. There's far more to software engineering than just "making computers do stuff" – while that phrase is accurate, it does not come close to describing what's involved in making robust, reliable software. I will use my experience building large scale systems to inform a first principles approach to defining what it is we do – or should be doing – as software engineers. I'm not interested in tired debates like dynamic vs. static languages – instead, I intend to explore the really core aspects of software engineering.

The first order of business is to define what software engineering even is in the first place. Software engineering is the construction of software that produces some desired output for some range of inputs. The inputs to software are more than just method parameters: they include the hardware on which it's running, the rate at which it receives data, and anything else that influences the operation of the software. Likewise, the output of software is more than just the data it emits and includes performance metrics like latency.

I think there's a distinction between programming a computer and software engineering. Programming is a deterministic undertaking: I give a computer a set of instructions and it executes those instructions. Software engineering is different. One of the most important realizations I've had is that while software is deterministic, you can't treat it as deterministic in any sort of practical sense if you want to build robust software.

Here's an anectode that, while simple, hits on a lot of what software engineering is really about. At Twitter my team operated a Storm cluster used by many teams throughout the company for production workloads. Storm depends on Zookeeper to store various pieces of state relating to Storm's operation. One of the pieces of state stored is information about recent errors in application workers. This information feeds a Storm UI which users look at to see if their applications have any errors in them (the UI also showed other things such as statistics of running applications). Whenever an error bubbles out of application code in a worker, Storm automatically reports that error into Zookeeper. If a user is suppressing the error in their application code, they can call a "reportError" method to manually add that error information into Zookeeper.

There was a serious omission in this design: we did not properly consider how that reportError method might be abused. One day we suddenly received a flood of alerts for the Storm cluster. The cluster was having serious problems and no one's application was running properly. Workers were constantly crashing and restarting.

All the errors were Zookeeper related. I looked at the metrics for Zookeeper and saw it was completely overloaded with traffic. It was very strange and I didn't know what could possibly be overloading it like that. I took a look at which Zookeeper nodes were receiving the most API calls, and it turned out almost all the traffic was coming to the set of nodes used to store errors for one particular application running on the cluster. I shut that application down and the cluster immediately went back to normal.

The question now was why that application was reporting so many errors. I took a closer look at the application and discovered that all the errors being reported were null pointer exceptions – a user had submitted an application with a bug in it causing it to throw that exception for every input tuple. In addition, the application was catching every exception, suppressing it, and manually calling reportError. This was causing reportError to be called at the same rate at which tuples were being received – which was a lot.

An unfortunate interaction between two mistakes led to a major failure of a production system. First, a user deployed buggy, sloppy code to the cluster. Second, the reportError method had an assumption in it that errors were rare and thereby the amount of traffic to that method would be inconsequential. The user's buggy code broke that assumption, overloading Zookeeper and causing a cascading failure that took down every other application on the cluster. We fixed the problem by throttling the rate at which errors could be reported to Zookeeper: errors reported beyond that rate would be logged locally but not written to Zookeeper. This made reportError robust to high traffic and eliminated the possibility for cascading failure due to abuse of that functionality.

As this story illustrates, there's a lot of uncertainty in software engineering. You think your code is correct – yet it still has bugs in it. How your software is actually used differs from the model in your head when you wrote the code. You made all sorts of assumptions while writing the software, some of which are broken in practice. Your dependencies, which you use as a black box, randomly fail due to a misunderstanding of their functional input range. The most salient feature of software engineering is the degree to which uncertainty permeates every aspect of the construction of software, from designing it to implementing it to operating it in a production environment.

Learning from other fields of engineering

It's useful to look at other forms of engineering to learn more about software engineering. Take bridge engineering, for example. The output of a bridge is a stable platform for crossing a chasm. Even though a bridge is a static structure, there are many inputs: the weight of the vehicles crossing, wind, rain, snow, the occasional earthquake, and so on. A bridge is engineered to operate correctly under certain ranges of those inputs. There's always some magnitude of an earthquake for which a bridge will not survive, and that's deemed okay because that's such a low probability event. Likewise, most bridges won't survive being hit by a missile.

Software is similar. Software operates correctly only within a certain range of inputs. Outside those inputs, it won't operate correctly, whether it's failure, security holes, or just poor performance. In my Zookeeper example, the Zookeeper cluster was hit with more traffic than it could handle, leading to application failure. Similarly, a distributed database can only handle so many hardware failures in a short amount of time before failing in some respect, like losing data. That's fine though, because you tune the replication factor until the probability of such an event is low enough.

Another useful field to look at is rocket engineering. It takes a lot of failure and iteration to build a rocket that works. SpaceX, for example, had three failed rocket launch attempts before they finally reached orbit. The cause of failure was always something unexpected, some series of inputs that the engineers didn't account for. Rockets are filled to the brim with telemetry so that failures can be understood, learned from, and fixed. Each failure lets the engineers understand the input ranges to the rocket a little better and better engineer the rocket to handle a greater and greater part of the input space. A rocket is never finished – you never know when there will be some low probability series of inputs you haven't experienced yet that will lead to failure. STS-107 was the 113th launch of the Space Shuttle, yet it ended in disaster.

Software is very similar. Making software robust is an iterative process: you build and test it as best you can, but inevitably in production you'll discover new areas of the input space that lead to failure. Like rockets, it's crucial to have excellent monitoring in place so that these issues can be diagnosed. Over time, the uncertainty in the input space goes down, and software gets "hardened". SQL injection attacks and viruses are great examples of things that take advantage of software that operates incorrectly for part of its input space.

There's always going to be some part of the input space for which software fails – as an engineer you have to balance the probabilities and cost tradeoffs to determine where to draw that line. For all of your dependencies, you better understand the input ranges for which the dependencies operate within spec and design your software accordingly.

Sources of uncertainty in software

There are many sources of uncertainty in software. The biggest is that we just don't know how to make perfect software: bugs can and will be deployed to production. No matter how much testing you do, bugs will slip through. Because of this fact of software development, all software must be viewed as probabilistic. The code you write only has some probability of being correct for all inputs. Sometimes seemingly minor failures will interact in ways that lead to much greater failures like in my Zookeeper example.

Another source of uncertainty is the fact that humans are involved in running software in production. Humans make mistakes – almost every software engineer has accidentally deleted data from a database at some point. I've also experienced many episodes where an engineer accidentally launched a program that overloaded a critical internal service, leading to cascading failures.

Another source of uncertainty is what functionality your software should even have – very rarely are the specs fully understood and fleshed out from the get go. Instead you have to learn as you go, iterate, and refine. This has huge implications on how software should be constructed and creates tension between the desire to create reusable components and the desire to avoid wasted work.

There's uncertainty in all the dependencies you're using. Your dependencies will have bugs in them or will have unexpected behavior for certain inputs. The first time I hit a file handle limit error on Linux is an example of not understanding the limits of a dependency.

Finally, another big source of uncertainty is not understanding the range of inputs your software will see in production. This leads to anything from incorrect functionality to poor performance to security holes like injection or denial of service attacks.

This is by no means an exhaustive overview of sources of uncertainty in software, but it's clear that uncertainty permeates all of the software engineering process.

Engineering for uncertainty

You can do a much better job building robust software by being cognizant of the uncertain nature of software. I've learned many techniques over the years on how to design software better given the inherent uncertainties. I think these techniques should be part of the bread and butter skills for any software engineer, but I think too many engineers fall under the "software is deterministic" reasoning trap and fail to account for the implications of unexpected events happening in production.

Minimize dependencies

One technique for making software more robust is to minimize what your software depends on – the less that can go wrong, the less that will go wrong. Minimizing dependencies is more nuanced than just not depending on System X or System Y, but also includes minimizing dependencies on features of systems you are using.

Storm's usage of Zookeeper is a good example of this. The location of all workers throughout the cluster is stored in Zookeeper. When a worker gets reassigned, other workers must discover the new location as quickly as possible so that they can send messages to the correct place. There are two ways for workers to do this discovery, either via the pull method or the push method. In the pull method, workers periodically poll Zookeeper to get the updated worker locations. In the push method, a Zookeeper feature called "watches" is used for Zookeeper to send the information to all workers whenever the locations change. The push method immediately propogates the information, making it faster than the pull method, but it introduces a dependency on another feature of Zookeeper.

Storm uses both methods to propogate the worker location information. Every few seconds, Storm polls for updated worker information. In addition to this, Storm uses Zookeeper watches as an optimization to try to get the location information as fast as possible. This design ensures that even if the Zookeeper watch feature fails to work, a worker will still get the correct location information (albeit a bit slower in that particular instance). So Storm is able to take advantage of the watch feature without being fundamentally dependent on it. Most of the time the watch feature will work correctly and information will propogate quickly, but in the case that watches fail Storm will still work. This design turned out to be farsighted, as there was a serious bug in watches that would have affected Storm.

There's always a tradeoff between minimizing dependencies and minimizing the amount of code you need to produce to implement your application. In this case, doing the dual approach to location propogation was a good approach because it was a very small amount of code to achieve independence from that feature. On the other hand, removing Zookeeper as a dependency completely would not have been a good idea, as replicating that functionality would have been a huge amount of work and less reliable than just using a widely-used open-source project.

Lessen probability of cascading failures

A cascading failure is one of the worst things that can happen in production – when it happens it feels like the world is falling apart. One of the most common causes of cascading failures in my experience are accidental denial of service attacks like in my reportError example. The ultimate cause in these cases is a failure to respect the functional input range for components in your system. You can greatly reduce cascading failures by making interactions between components in your system explicitly respect those input ranges by using self-throttling to avoid accidental DOS'ng. This is the approach I used in my reportError example.

Another great technique for avoiding cascading failures is to isolate your components as much as possible and take away the ability for different components to affect each other. This is often easier said than done, but when possible it is a very useful technique.

Measure and monitor

When something unexpected happens in production, it's critical to have thorough monitoring in place so that you can figure out what happened. As software hardens more and more, unexpected events will get more and more infrequent and reproducing those events will become harder and harder. So when one of those unexpected events happens, you want as much data about the event as possible.

Software should be designed from the start to be monitored. I consider the monitoring aspects of software just as important as the functionality of the software itself. And everything should be measured – latencies, throughput stats, buffer sizes, and anything else relevant to the application. Monitoring is the most important defense against software's inherent uncertainty.

In the same vein, it's important to do measurements of all your components to gain an understanding of their functional input ranges. What throughputs can each component handle? How is latency affected by more traffic? How can you break those components? Doing this measurement work isn't glamorous but is essential to solid engineering.

Conclusion

Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs. Recognizing and planning for these uncertainties will make your software more reliable – and make you a better engineer.

You should follow me on Twitter here.

Monday
Apr012013

My new startup

There's been a lot of speculation about what my new startup is doing, so I've decided to set the record straight and reveal all. We are working on one of the biggest problems on Earth, a problem that affects nearly every person on this planet. Our products will significantly improve the quality of life for billions of people.

We are going to revolutionize the bedsheet industry.

Think about it. There's been almost no innovation in bedsheets in thousands of years. There's nothing worse than waking up to discover one of the corners of your Egyptian cotton fitted sheets has slipped off the mattress. How is this not a solved problem yet? Why are we still using sheets with that annoying elastic in it to secure them to our mattresses? They slip all the time – and if you have a deep mattress, good luck finding sheets that even fit. You're just screwed.

Consider the impact of solving this problem, of a bedsheet product that never slips, that always stays secure on the mattress. This translates to better sleep, to less grogginess in the morning, to feeling more upbeat in the morning. This translates into less morning arguments between husbands and wives that spiral into divorces, child custody battles, and decades of trauma for the children.

Not only is this a big problem – it's a big opportunity. We've done extensive market research and discovered that our target market is the entire human population. At 7 billion people and an estimated average sale of $20 per sheet, this is at least a $140,000,000,000 opportunity.

We are going to solve this problem using modern, 21st century techniques and take bedsheets out of the Stone Age and into the future. We are going to make it possible to attach your sheets to your bed, completely solving the problem.

Solving this problem in a practical and cost-efficient way is not easy and will require significant engineering breakthroughs. If you're a world-class, rock star by day and ninja by night engineer who's as passionate about bedsheets as I am, please get in touch. I'd love to talk to you.

Bedsheets have been my true passion since I was a child. I'm excited to finally be focused on what I really care about, and I can't wait until the day when untucked sheets are a curious relic of the past.

Saturday
Mar162013

Leaving Twitter

Yesterday was my last day at Twitter. I left to start my own company. What I'll be working on is very exciting (though I'm keeping it secret for now).

Leaving Twitter was a tough decision. I worked with a whole bunch of great people on fascinating problems with some of the most interesting data in the world. Ultimately though, I felt that if I didn't make this move, I would regret it for the rest of my life. So I put in my papers about a month ago and then spent a month transitioning my team for my departure.

This ends an eventful three years that started with me joining BackType in January of 2010. So much has happened in these past three years. I open-sourced Cascalog, ElephantDB, and Storm, started writing a book, gave a lot of talks, and in July of 2011 experienced the thrill of being acquired. My projects spread beyond BackType and Twitter to be relied on by dozens and dozens of companies. Through all this, I learned an enormous amount about entrepreneurship, product development, marketing, recruiting, and project management.

Stay tuned.

Wednesday
Sep192012

Storm's 1st birthday

Storm was open-sourced exactly one year ago today. It's been an action-packed year for Storm, to say the least. Here's some of the exciting stuff that's happened over the past year:

  • 27 companies have publicized that they're using Storm in production. I know of at least a few more companies using it that haven't published anything yet.
  • O'Reilly published a book on Storm.
  • The Storm mailing list has over 1300 members, with over 500 messages per month.
  • The @stormprocessor account has over 1200 followers.
  • More than 4000 people have starred the project on Github.
  • There's a regular Storm meetup in the Bay Area with over 230 members. I've also seen lots of Storm-focused meetups happen all over the world over the past year.
  • 29 people all over the world have contributed to the codebase
  • We released Trident, a high level abstraction for realtime computation, that is a major leap forward in what's possible in realtime.
  • Libraries have been released integrating Storm with Kestrel, Kafka, JMS, Cassandra, Memcached, and many more systems. For many, Storm is becoming the system of choice for connecting these systems together.
  • Storm's performance has been increased by over 10x. I've benchmarked it at 1M messages per second per node on an internal Twitter cluster.

What I overwhelmingly hear from people is that they like Storm because it's simple to understand, flexible, and extremely robust in production. These have always been some of the core design goals of Storm, so I'm glad that we were able to succeed on these points.

We've got lots of exciting stuff planned over the next year. We have a new metrics system in development which will let you get deep insight into what's happening throughout your topology in realtime. And we have big plans for improving Trident and integrating it with more datastores and input sources.

Happy birthday Storm!

Monday
Feb062012

Suffering-oriented programming

Someone asked me an interesting question the other day: "How did you justify taking such a huge risk on building Storm while working on a startup?" (Storm is a realtime computation system). I can see how from an outsider's perspective investing in such a massive project seems extremely risky for a startup. From my perspective, though, building Storm wasn't risky at all. It was challenging, but not risky.

I follow a style of development that greatly reduces the risk of big projects like Storm. I call this style "suffering-oriented programming." Suffering-oriented programming can be summarized like so: don't build technology unless you feel the pain of not having it. It applies to the big, architectural decisions as well as the smaller everyday programming decisions. Suffering-oriented programming greatly reduces risk by ensuring that you're always working on something important, and it ensures that you are well-versed in a problem space before attempting a large investment.

I have a mantra for suffering-oriented programming: "First make it possible. Then make it beautiful. Then make it fast."

First make it possible

When encountering a problem domain with which you're unfamiliar, it's a mistake to try to build a "general" or "extensible" solution right off the bat. You just don't understand the problem domain well enough to anticipate what your needs will be in the future. You'll make things generic that needn't be, adding complexity and wasting time.

It's better to just "hack things out" and be very direct about solving the problems you have at hand. This allows you to get done what you need to get done and avoid wasted work. As you're hacking things out, you'll learn more and more about the intricacies of the problem space.

The "make it possible" phase for Storm was one year of hacking out a stream processing system using queues and workers. We learned about guaranteeing data processing using an "ack" protocol. We learned to scale our realtime computations with clusters of queues and workers. We learned that sometimes you need to partition a message stream in different ways, sometimes randomly and sometimes using a hash/mod technique that makes sure the same entity always goes to the same worker.

We didn't even know we were in the "make it possible" phase. We were just focused on building our products. The pain of the queues and workers system became acute very quickly though. Scaling the queues and workers system was tedious, and the fault-tolerance was nowhere near what we wanted. It was evident that the queues and workers paradigm was not at the right level of abstraction, as most of our code had to do with routing messages and serialization and not the actual business logic we cared about.

At the same time, developing our product drove us to discover new use cases in the "realtime computation" problem space. We built a feature for our product that would compute the reach of a URL on Twitter. Reach is the number of unique people exposed to a URL on Twitter. It's a difficult computation that can require hundreds of database calls and tens of millions of impressions to distinct just for one computation. Our original implementation that ran on a single machine would take over a minute for hard URLs, and it was clear that we needed a distributed system of some sort to parallelize the computation to make it fast.

One of the key realizations that sparked Storm was that the "reach problem" and the "stream processing" problem could be unified by a simple abstraction.

Then make it beautiful

You develop a "map" of the problem space as you explore it by hacking things out. Over time, you acquire more and more use cases within the problem domain and develop a deep understanding of the intricacies of building these systems. This deep understanding can guide the creation of "beautiful" technology to replace your existing systems, alleviate your suffering, and enable new systems/features that were too hard to build before.

The key to developing the "beautiful" solution is figuring out the simplest set of abstractions that solve the concrete use cases you already have. It's a mistake to try to anticipate use cases you don't actually have or else you'll end up overengineering your solution. As a rule of thumb, the bigger the investment you're trying to make, the deeper you need to understand the problem domain and the more diverse your use cases need to be. Otherwise you risk the second-system effect.

"Making it beautiful" is where you use your design and abstraction skills to distill the problem space into simple abstractions that can be composed together. I view the development of beautiful abstractions as similar to statistical regression: you have a set of points on a graph (your use cases) and you're looking for the simplest curve that fits those points (a set of abstractions).

The more use cases you have, the better you'll be able to find the right curve to fit those points. If you don't have enough points, you're likely to either overfit or underfit the graph, leading to wasted work and overengineering.

A big part of making it beautiful is understanding the performance and resource characteristics of the problem space. This is one of the intricacies you learn in the "making it possible" phase, and you should take advantage of that learning when designing your beautiful solution.

With Storm, I distilled the realtime computation problem domain into a small set of abstractions: streams, spouts, bolts, and topologies. I devised a new algorithm for guaranteeing data processing that eliminated the need for intermediate message brokers, the part of our system that caused the most complexity and suffering. That both stream processing and reach, two very different problems on the surface, mapped so elegantly to Storm was a strong indicator that I was onto something big.

I took additional steps to acquire more use cases for Storm and validate my designs. I canvassed other engineers to learn about the particulars of the realtime problems they were dealing with. I didn't just ask people I knew. I also tweeted out that I was working on a new realtime system and wanted to learn about other people's use cases. This led to a lot of interesting discussions that educated me more on the problem domain and validated my design ideas.

Then make it fast

Once you've built out your beautiful design, you can safely invest time in profiling and optimization. Doing optimization too early will just waste time, because you still might rethink the design. This is called premature optimization.

"Making it fast" isn't about the high level performance characteristics of a system. The understanding of those issues should have been acquired in the "make it possible" phase and designed for in the "make it beautiful" phase. "Making it fast" is about micro-optimizations and tightening up the code to be more resource efficient. So you might worry about things like asymptotic complexity in the "make it beautiful" phase and focus on the constant-time factors in the "make it fast" phase.

Rinse and repeat

Suffering-oriented programming is a continuous process. The beautiful systems you build give you new capabilities, which allow you to "make it possible" in new and deeper areas of the problem space. This feeds learning back to the technology. You often have to tweak or add to the abstractions you've already come up with to handle more and more use cases.

Storm has gone through many iterations like this. When we first started using Storm, we discovered that we needed the capability to emit multiple, independent streams from a single component. We discovered that the addition of a special kind of stream called the "direct stream" would allow Storm to process batches of tuples as a concrete unit. Recently I developed "transactional topologies" which go beyond Storm's at-least-once processing guarantee and allow exactly-once messaging semantics to be achieved for nearly arbitrary realtime computation.

By its nature, hacking things out in a problem domain you don't understand so well and constantly iterating can lead to some sloppy code. The most important characteristic of a suffering-oriented programmer is a relentless focus on refactoring. This is critical to prevent accidental complexity from sabotaging the codebase.

Conclusion

Use cases are everything in suffering-oriented programming. They're worth their weight in gold. The only way to acquire use cases is through gaining experience through hacking.

There's a certain evolution most programmers go through. You start off struggling to get things to work and have absolutely no structure to your code. Code is sloppy and copy/pasting is prevalent. Eventually you learn about the benefits of structured programming and sharing logic as much as possible. Then you learn about making generic abstractions and using encapsulation to make it easier to reason about systems. Then you become obsessed with making all your code generic, with making things extensible to future-proof your programs.

Suffering-oriented programming rejects that you can effectively anticipate needs you don't currently have. It recognizes that attempts to make things generic without a deep understanding of the problem domain will lead to complexity and waste. Designs must always be driven by real, tangible use cases.

You should follow me on Twitter here.

Monday
Jan092012

Early access edition of my book is available

The early access edition of my book Big Data: principles and best practices of scalable realtime data systems is now available from Manning! I've been working on this book for quite some time, and I'm excited to have it out there and start getting some feedback.

The interest in the book has already been overwhelming, and I've been answering questions about it on Hacker News.

Thursday
Oct132011

How to beat the CAP theorem

The CAP theorem states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But you can't sacrifice partition-tolerance (see here and here), so you must make a tradeoff between availability and consistency. Managing this tradeoff is a central focus of the NoSQL movement.

Consistency means that after you do a successful write, future reads will always take that write into account. Availability means that you can always read and write to the system. During a partition, you can only have one of these properties.

Systems that choose consistency over availability have to deal with some awkward issues. What do you do when the database isn't available? You can try buffering writes for later, but you risk losing those writes if you lose the machine with the buffer. Also, buffering writes can be a form of inconsistency because a client thinks a write has succeeded but the write isn't in the database yet. Alternatively, you can return errors back to the client when the database is unavailable. But if you've ever used a product that told you to "try again later", you know how aggravating this can be.

The other option is choosing availability over consistency. The best consistency guarantee these systems can provide is "eventual consistency". If you use an eventually consistent database, then sometimes you'll read a different result than you just wrote. Sometimes multiple readers reading the same key at the same time will get different results. Updates may not propagate to all replicas of a value, so you end up with some replicas getting some updates and other replicas getting different updates. It is up to you to repair the value once you detect that the values have diverged. This requires tracing back the history using vector clocks and merging the updates together (called "read repair").

I believe that maintaining eventual consistency in the application layer is too heavy of a burden for developers. Read-repair code is extremely susceptible to developer error; if and when you make a mistake, faulty read-repairs will introduce irreversible corruption into the database.

So sacrificing availability is problematic and eventual consistency is too complex to reasonably build applications. Yet these are the only two options, so it seems like I'm saying that you're damned if you do and damned if you don't. The CAP theorem is a fact of nature, so what alternative can there possibly be?

There is another way. You can't avoid the CAP theorem, but you can isolate its complexity and prevent it from sabotaging your ability to reason about your systems. The complexity caused by the CAP theorem is a symptom of fundamental problems in how we approach building data systems. Two problems stand out in particular: the use of mutable state in databases and the use of incremental algorithms to update that state. It is the interaction between these problems and the CAP theorem that causes complexity.

In this post I'll show the design of a system that beats the CAP theorem by preventing the complexity it normally causes. But I won't stop there. The CAP theorem is a result about the degree to which data systems can be fault-tolerant to machine failure. Yet there's a form of fault-tolerance that's much more important than machine fault-tolerance: human fault-tolerance. If there's any certainty in software development, it's that developers aren't perfect and bugs will inevitably reach production. Our data systems must be resilient to buggy programs that write bad data, and the system I'm going to show is as human fault-tolerant as you can get.

This post is going to challenge your basic assumptions on how data systems should be built. But by breaking down our current ways of thinking and re-imagining how data systems should be built, what emerges is an architecture more elegant, scalable, and robust than you ever thought possible.

Click to read more ...

Tuesday
Mar292011

My talks at POSSCON

Last week I went to POSSCON in Columbia, South Carolina. It was an interesting experience and a good reminder that not everyone in the world thinks like we do in Silicon Valley.

I gave two talks at the conference. One was a technical talk about how to build realtime Big Data systems, and the other was a non-technical talk about the things we do at BackType to be a super-productive team. Both slide decks are embedded below.