Entries from April 1, 2013 - April 30, 2013

Friday
Apr122013

Break into Silicon Valley with a blog

I know a lot of non-technical people who would love to work in the venture-funded startup world, from consultants to finance people to other business types for which I'm not really sure exactly what it is they do. They hit obstacles trying to get into the startup world, finding that their skills are either irrelevant or hard to explain. My advice to all these people is the same:

Write a blog.

A blog can improve your life in enormous ways. Or to put it in business-speak: a blog has one of the highest ROI's of anything you can do.

Put yourself in the shoes of startups looking for talent. First off – startups are desperate for talent. The problem is that it's very difficult to identify great people – startups search through loads and loads of candidates.

Resumes and interviews only tell you so much about a person. It's really hard to stand out in a resume – you're not the only one putting over-inflated impressive-looking numbers and bullet points on your resume. And interviews are notorious for labeling bad people as good and good people as bad. So to maximize your odds of making it through the funnel, you need to show that you're awesome independent from the randomness of the normal process.

One thing you can do is write an insightful blog. This makes you look a lot more compelling. Now the reaction from startups will be "Hey, this person's really smart. We don't want to miss out on a potentially great hire, so let's put in a lot of effort to determine if there's a good fit."

A new dimension of opportunity

There's another huge advantage to having a blog besides being a mechanism to show that you're smart and insightful. A blog opens up a whole new dimension of opportunity for you. Instead of relying purely on outbound opportunities that you explicitly seek out yourself, you also will get inbound opportunities where people reach out to you with opportunities you never expected or dreamed of.

With an outbound opportunity you know exactly what you're seeking, whether it's landing a job or speaking at a conference or something else. Inbound opportunities, on the other hand, are highly uncertain. They come to you out of the blue. In my personal experience, many of the most awesome things I've done started as inbound opportunities: a book deal, flying all around the world for free to speak at conferences, a keynote at a major conference, and connecting with hundreds of awesome people who have reached out to me because of something I did or wrote publicly.

When you write a blog, you greatly increase the likelihood of getting awesome inbound opportunities. When it comes to breaking into Silicon Valley – instead of everything being on your shoulders to seek interesting companies, those companies will be reaching out to you.

A great phrase I've heard for this is "increasing your luck surface area". By providing value to people publicly, like writing insightful posts on a blog, you open yourself up to serendipitous, "lucky" opportunities.

Getting readers

Besides writing smart posts, you also need people to read your writing. Here's a few tips for accomplishing that.

First off, the title of a blog post is incredibly important. In very few words, you need to sell your potential reader that your blog post is going to be worth their time. I've found the the best titles are relevant to the potential reader, somewhat mysterious, and non-generic. Titles are definitely an art form, so you should think hard about how you'll name your posts. Sometimes I wait days to publish a post because I haven't thought of a compelling enough title.

Second, I highly recommend using Twitter as a distribution platform for your blog posts. The combination of Twitter and blogging leads to a beautifully virtuous cycle: your blog increases your Twitter following, and as your Twitter following grows you increase the reach of your blog. I consider Twitter to be the greatest professional networking tool ever devised – I follow people who tweet/blog interesting things and they follow me for the same reason. Then when I go to conferences I seek out the people who I know and respect from their online presence. When we meet, we already know a lot about each other and have a lot to talk about.

Lastly, you should embrace the online communities who will care about your blog. In Silicon Valley, the most important community is Hacker News. Hacker News is widely read in Silicon Valley by programmers, entrepreneurs, and investors. It can drive a lot of readers to your blog in a short amount of time.

Initially, it may be hard for you to get readers. Getting your posts on Hacker News is very much a crapshoot, and initially you'll have too small of a Twitter following to get that much distribution. But occasionally you'll write something smart that gets on Hacker News and gets shared around. Over time as your writing and distribution improves getting readers gets easier and easier.

What to write about

If you don't think you have anything to write about, then let me ask you a question. Do you really have that low of an opinion of yourself? Do you really think you have nothing interesting that you can share with the world? There's tons of stuff that you can write about that you don't even know to share. You have a ton of knowledge that you don't realize other people don't know because you spend all your time in your own head. Tell stories of times that you hustled. Write about the dynamics of big companies. Write case studies of anything related to running a business. Analyze the market for interesting new technologies (e.g. 3D printing, Bitcoin, etc). There's so much that you can write about.

Once you start blogging, you'll become attuned to random ideas you have throughout the day that would make good blog posts. Most of my blog ideas start off as email reminders to myself.

Final thoughts

If you haven't blogged before, you're going to suck at first. Being accurate, precise, and insightful is not enough. You have to learn how to hook people into your posts and keep the post engaging. You'll learn about the glorious world of internet commenting, where people constantly misinterpret what you say and apply very fallacious reasoning to your posts. You'll see people trash your ideas on Hacker News even though it's clear they didn't read your entire post. Sometimes they comment having only read the title! You'll learn over time different ways to structure the same information in order to minimize misinterpretation. You'll learn to anticipate fallacious reasoning and preemptively address those fallacies.

With writing, practice most definitely makes perfect. I sucked at writing at first, but I quickly improved.

A lot of people say they "don't have time to write." To be blunt, I think comments like this are the result of laziness and self-delusion. Writing a blog is really not that much work. You really can't find a couple hours to pump out a blog post? Just occasionally, instead of going out to the bar or seeing a movie or going surfing or doing whatever it is you do for fun, try writing. The potential benefits relative to the investment are MASSIVE. I haven't even discussed all the other benefits which on their own make blogging worthwhile.

Of course, writing isn't the only thing you can do to help yourself break into Silicon Valley. But it's an enormously easy way to make yourself stand out and open yourself to opportunities you never expected.

You should follow me on Twitter here.

Tuesday
Apr022013

Principles of Software Engineering, Part 1

This is the first in a series of posts on the principles of software engineering. There's far more to software engineering than just "making computers do stuff" – while that phrase is accurate, it does not come close to describing what's involved in making robust, reliable software. I will use my experience building large scale systems to inform a first principles approach to defining what it is we do – or should be doing – as software engineers. I'm not interested in tired debates like dynamic vs. static languages – instead, I intend to explore the really core aspects of software engineering.

The first order of business is to define what software engineering even is in the first place. Software engineering is the construction of software that produces some desired output for some range of inputs. The inputs to software are more than just method parameters: they include the hardware on which it's running, the rate at which it receives data, and anything else that influences the operation of the software. Likewise, the output of software is more than just the data it emits and includes performance metrics like latency.

I think there's a distinction between programming a computer and software engineering. Programming is a deterministic undertaking: I give a computer a set of instructions and it executes those instructions. Software engineering is different. One of the most important realizations I've had is that while software is deterministic, you can't treat it as deterministic in any sort of practical sense if you want to build robust software.

Here's an anectode that, while simple, hits on a lot of what software engineering is really about. At Twitter my team operated a Storm cluster used by many teams throughout the company for production workloads. Storm depends on Zookeeper to store various pieces of state relating to Storm's operation. One of the pieces of state stored is information about recent errors in application workers. This information feeds a Storm UI which users look at to see if their applications have any errors in them (the UI also showed other things such as statistics of running applications). Whenever an error bubbles out of application code in a worker, Storm automatically reports that error into Zookeeper. If a user is suppressing the error in their application code, they can call a "reportError" method to manually add that error information into Zookeeper.

There was a serious omission in this design: we did not properly consider how that reportError method might be abused. One day we suddenly received a flood of alerts for the Storm cluster. The cluster was having serious problems and no one's application was running properly. Workers were constantly crashing and restarting.

All the errors were Zookeeper related. I looked at the metrics for Zookeeper and saw it was completely overloaded with traffic. It was very strange and I didn't know what could possibly be overloading it like that. I took a look at which Zookeeper nodes were receiving the most API calls, and it turned out almost all the traffic was coming to the set of nodes used to store errors for one particular application running on the cluster. I shut that application down and the cluster immediately went back to normal.

The question now was why that application was reporting so many errors. I took a closer look at the application and discovered that all the errors being reported were null pointer exceptions – a user had submitted an application with a bug in it causing it to throw that exception for every input tuple. In addition, the application was catching every exception, suppressing it, and manually calling reportError. This was causing reportError to be called at the same rate at which tuples were being received – which was a lot.

An unfortunate interaction between two mistakes led to a major failure of a production system. First, a user deployed buggy, sloppy code to the cluster. Second, the reportError method had an assumption in it that errors were rare and thereby the amount of traffic to that method would be inconsequential. The user's buggy code broke that assumption, overloading Zookeeper and causing a cascading failure that took down every other application on the cluster. We fixed the problem by throttling the rate at which errors could be reported to Zookeeper: errors reported beyond that rate would be logged locally but not written to Zookeeper. This made reportError robust to high traffic and eliminated the possibility for cascading failure due to abuse of that functionality.

As this story illustrates, there's a lot of uncertainty in software engineering. You think your code is correct – yet it still has bugs in it. How your software is actually used differs from the model in your head when you wrote the code. You made all sorts of assumptions while writing the software, some of which are broken in practice. Your dependencies, which you use as a black box, randomly fail due to a misunderstanding of their functional input range. The most salient feature of software engineering is the degree to which uncertainty permeates every aspect of the construction of software, from designing it to implementing it to operating it in a production environment.

Learning from other fields of engineering

It's useful to look at other forms of engineering to learn more about software engineering. Take bridge engineering, for example. The output of a bridge is a stable platform for crossing a chasm. Even though a bridge is a static structure, there are many inputs: the weight of the vehicles crossing, wind, rain, snow, the occasional earthquake, and so on. A bridge is engineered to operate correctly under certain ranges of those inputs. There's always some magnitude of an earthquake for which a bridge will not survive, and that's deemed okay because that's such a low probability event. Likewise, most bridges won't survive being hit by a missile.

Software is similar. Software operates correctly only within a certain range of inputs. Outside those inputs, it won't operate correctly, whether it's failure, security holes, or just poor performance. In my Zookeeper example, the Zookeeper cluster was hit with more traffic than it could handle, leading to application failure. Similarly, a distributed database can only handle so many hardware failures in a short amount of time before failing in some respect, like losing data. That's fine though, because you tune the replication factor until the probability of such an event is low enough.

Another useful field to look at is rocket engineering. It takes a lot of failure and iteration to build a rocket that works. SpaceX, for example, had three failed rocket launch attempts before they finally reached orbit. The cause of failure was always something unexpected, some series of inputs that the engineers didn't account for. Rockets are filled to the brim with telemetry so that failures can be understood, learned from, and fixed. Each failure lets the engineers understand the input ranges to the rocket a little better and better engineer the rocket to handle a greater and greater part of the input space. A rocket is never finished – you never know when there will be some low probability series of inputs you haven't experienced yet that will lead to failure. STS-107 was the 113th launch of the Space Shuttle, yet it ended in disaster.

Software is very similar. Making software robust is an iterative process: you build and test it as best you can, but inevitably in production you'll discover new areas of the input space that lead to failure. Like rockets, it's crucial to have excellent monitoring in place so that these issues can be diagnosed. Over time, the uncertainty in the input space goes down, and software gets "hardened". SQL injection attacks and viruses are great examples of things that take advantage of software that operates incorrectly for part of its input space.

There's always going to be some part of the input space for which software fails – as an engineer you have to balance the probabilities and cost tradeoffs to determine where to draw that line. For all of your dependencies, you better understand the input ranges for which the dependencies operate within spec and design your software accordingly.

Sources of uncertainty in software

There are many sources of uncertainty in software. The biggest is that we just don't know how to make perfect software: bugs can and will be deployed to production. No matter how much testing you do, bugs will slip through. Because of this fact of software development, all software must be viewed as probabilistic. The code you write only has some probability of being correct for all inputs. Sometimes seemingly minor failures will interact in ways that lead to much greater failures like in my Zookeeper example.

Another source of uncertainty is the fact that humans are involved in running software in production. Humans make mistakes – almost every software engineer has accidentally deleted data from a database at some point. I've also experienced many episodes where an engineer accidentally launched a program that overloaded a critical internal service, leading to cascading failures.

Another source of uncertainty is what functionality your software should even have – very rarely are the specs fully understood and fleshed out from the get go. Instead you have to learn as you go, iterate, and refine. This has huge implications on how software should be constructed and creates tension between the desire to create reusable components and the desire to avoid wasted work.

There's uncertainty in all the dependencies you're using. Your dependencies will have bugs in them or will have unexpected behavior for certain inputs. The first time I hit a file handle limit error on Linux is an example of not understanding the limits of a dependency.

Finally, another big source of uncertainty is not understanding the range of inputs your software will see in production. This leads to anything from incorrect functionality to poor performance to security holes like injection or denial of service attacks.

This is by no means an exhaustive overview of sources of uncertainty in software, but it's clear that uncertainty permeates all of the software engineering process.

Engineering for uncertainty

You can do a much better job building robust software by being cognizant of the uncertain nature of software. I've learned many techniques over the years on how to design software better given the inherent uncertainties. I think these techniques should be part of the bread and butter skills for any software engineer, but I think too many engineers fall under the "software is deterministic" reasoning trap and fail to account for the implications of unexpected events happening in production.

Minimize dependencies

One technique for making software more robust is to minimize what your software depends on – the less that can go wrong, the less that will go wrong. Minimizing dependencies is more nuanced than just not depending on System X or System Y, but also includes minimizing dependencies on features of systems you are using.

Storm's usage of Zookeeper is a good example of this. The location of all workers throughout the cluster is stored in Zookeeper. When a worker gets reassigned, other workers must discover the new location as quickly as possible so that they can send messages to the correct place. There are two ways for workers to do this discovery, either via the pull method or the push method. In the pull method, workers periodically poll Zookeeper to get the updated worker locations. In the push method, a Zookeeper feature called "watches" is used for Zookeeper to send the information to all workers whenever the locations change. The push method immediately propogates the information, making it faster than the pull method, but it introduces a dependency on another feature of Zookeeper.

Storm uses both methods to propogate the worker location information. Every few seconds, Storm polls for updated worker information. In addition to this, Storm uses Zookeeper watches as an optimization to try to get the location information as fast as possible. This design ensures that even if the Zookeeper watch feature fails to work, a worker will still get the correct location information (albeit a bit slower in that particular instance). So Storm is able to take advantage of the watch feature without being fundamentally dependent on it. Most of the time the watch feature will work correctly and information will propogate quickly, but in the case that watches fail Storm will still work. This design turned out to be farsighted, as there was a serious bug in watches that would have affected Storm.

There's always a tradeoff between minimizing dependencies and minimizing the amount of code you need to produce to implement your application. In this case, doing the dual approach to location propogation was a good approach because it was a very small amount of code to achieve independence from that feature. On the other hand, removing Zookeeper as a dependency completely would not have been a good idea, as replicating that functionality would have been a huge amount of work and less reliable than just using a widely-used open-source project.

Lessen probability of cascading failures

A cascading failure is one of the worst things that can happen in production – when it happens it feels like the world is falling apart. One of the most common causes of cascading failures in my experience are accidental denial of service attacks like in my reportError example. The ultimate cause in these cases is a failure to respect the functional input range for components in your system. You can greatly reduce cascading failures by making interactions between components in your system explicitly respect those input ranges by using self-throttling to avoid accidental DOS'ng. This is the approach I used in my reportError example.

Another great technique for avoiding cascading failures is to isolate your components as much as possible and take away the ability for different components to affect each other. This is often easier said than done, but when possible it is a very useful technique.

Measure and monitor

When something unexpected happens in production, it's critical to have thorough monitoring in place so that you can figure out what happened. As software hardens more and more, unexpected events will get more and more infrequent and reproducing those events will become harder and harder. So when one of those unexpected events happens, you want as much data about the event as possible.

Software should be designed from the start to be monitored. I consider the monitoring aspects of software just as important as the functionality of the software itself. And everything should be measured – latencies, throughput stats, buffer sizes, and anything else relevant to the application. Monitoring is the most important defense against software's inherent uncertainty.

In the same vein, it's important to do measurements of all your components to gain an understanding of their functional input ranges. What throughputs can each component handle? How is latency affected by more traffic? How can you break those components? Doing this measurement work isn't glamorous but is essential to solid engineering.

Conclusion

Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs. Recognizing and planning for these uncertainties will make your software more reliable – and make you a better engineer.

You should follow me on Twitter here.

Monday
Apr012013

My new startup

There's been a lot of speculation about what my new startup is doing, so I've decided to set the record straight and reveal all. We are working on one of the biggest problems on Earth, a problem that affects nearly every person on this planet. Our products will significantly improve the quality of life for billions of people.

We are going to revolutionize the bedsheet industry.

Think about it. There's been almost no innovation in bedsheets in thousands of years. There's nothing worse than waking up to discover one of the corners of your Egyptian cotton fitted sheets has slipped off the mattress. How is this not a solved problem yet? Why are we still using sheets with that annoying elastic in it to secure them to our mattresses? They slip all the time – and if you have a deep mattress, good luck finding sheets that even fit. You're just screwed.

Consider the impact of solving this problem, of a bedsheet product that never slips, that always stays secure on the mattress. This translates to better sleep, to less grogginess in the morning, to feeling more upbeat in the morning. This translates into less morning arguments between husbands and wives that spiral into divorces, child custody battles, and decades of trauma for the children.

Not only is this a big problem – it's a big opportunity. We've done extensive market research and discovered that our target market is the entire human population. At 7 billion people and an estimated average sale of $20 per sheet, this is at least a $140,000,000,000 opportunity.

We are going to solve this problem using modern, 21st century techniques and take bedsheets out of the Stone Age and into the future. We are going to make it possible to attach your sheets to your bed, completely solving the problem.

Solving this problem in a practical and cost-efficient way is not easy and will require significant engineering breakthroughs. If you're a world-class, rock star by day and ninja by night engineer who's as passionate about bedsheets as I am, please get in touch. I'd love to talk to you.

Bedsheets have been my true passion since I was a child. I'm excited to finally be focused on what I really care about, and I can't wait until the day when untucked sheets are a curious relic of the past.