Thrift + Graphs = Strong, flexible schemas on Hadoop

There are a lot of misconceptions about what Hadoop is useful for and what kind of data you can put in it. A lot of people think that Hadoop is meant for unstructured data like log files. While Hadoop is great for log files, it's also fantastic for strongly typed, structured data.

In this post I'll discuss how you can use a tool like Thrift to store strongly typed data in Hadoop while retaining the flexibility to evolve your schema. We'll look at graph-based schemas and see why they are an ideal fit for many Hadoop-based applications.

Click to read more ...


Follow-up to "The mathematics behind Hadoop-based systems"

In a previous post, I developed an equation modeling the stable runtime of an iterative, batch-oriented workflow. We saw how the equation explained a number of counter-intuitive behaviors of batch-oriented systems. In this post, we will learn how to measure the amount of overhead versus dynamic time in a workflow, which is the first step in applying the theory to optimize a workflow.

Click to read more ...


Introducing "Nanny" - a really simple dependency management tool

Dependency management in software projects is a pretty simple problem when you think about it. A tool to manage dependencies just needs to do three things:

  1. Provide a mechanism to specify the direct dependencies to a project
  2. Download the transitive closure of dependencies to a project
  3. Publish packages that can be used as a dependency to other projects

Some languages have good dependency management systems - for example, rubygems. Others, like Java, have tools like Maven which I would call a complex solution to a simple problem. You shouldn't need to buy a book to understand the solution to such a simple problem. Plus, these dependency management systems are all language specific.

I've seen companies do crazy things to manage their dependencies. One company, to manage their jar files, would put all the jars that any project might need in a special "jars" project. You would then need to setup a JARS_HOME environment variable and be sure to update the jars project if you need any of the dependencies. If you needed an older version of something - forget about it. Plus it made deploys a huge pain, as each project had to ship with dependencies it didn't even use.

Enter Nanny.

Nanny makes it really easy to setup an internal repository to manage dependencies between projects. I spent a night hacking out Nanny and we're finding it incredibly useful at BackType. We manage dependencies between all our Java/Clojure projects using it, we distribute custom builds of Hadoop and Cassandra with it, and we're starting to use it to manage dependencies between our Python projects.

Nanny is hosted at GitHub and comes with documentation to get you started in no time.

You should follow me on Twitter here.


Why so many research papers are so hard to understand

Wondering why so many research papers are so hard to read? I got some great words of wisdom from Professor Jean-Claude Latombe on the subject back when I was in his research group at Stanford. He described two strategies people employ for getting your paper published in a journal. The first is to do some great research and write the results up in a clear, well-written way.

Sometimes, however, you may invest time in research and end up with not particularly interesting results. If you wrote a paper describing your insignificant research in a clear manner, then your paper would clearly be bad and would be rejected. A strategy to employ in this scenario is to present your research in a complex, non-straightforward manner. Now no one will ever say your paper is great, but people will be less likely to say your paper is flat out bad (after all, it sounds like you were researching something really complex!). So your paper will fall into the middle of the pack which may be good enough to get published.

For an incredibly vivid illustration of this strategy, check out SCIgen, an automatic CS paper generator that got a randomly generated paper accepted into a conference.

You should follow me on Twitter here.


Stateless, fault-tolerant scheduling using randomness

Wrote a fun little post on the BackType tech blog about "Stateless, fault-tolerant scheduling using randomness":


My conversation with the great John McCarthy

Back when I was a sophomore at Stanford, I cold emailed John McCarthy (inventor of Lisp, one of the godfathers of AI) and asked if I could meet him and learn about his life and perspective. To my pleasant surprise, he was happy to meet with me. A week later I biked over to his house and had a very interesting two hour conversation with him.

McCarthy told me a lot about his career and his transition from MIT to Stanford. What struck me the most though was what he had to say about Lisp. Lisp was a complete afterthought for McCarthy. He just needed a language for doing AI research, his true passion. Since the languages available at that time were difficult to use, he created a language that he felt would make him the most productive in his research. This reminds me a lot about how Isaac Newton invented the tool of calculus to further his research in gravity. Calculus transformed mathematics in its own right, and the innovations of Lisp are still reverberating through the programming language world. Yet both were not ends in themselves, but means towards larger goals. Lisp, one of the most abstract languages you can use, was created with a purely practical motivation.

I consider it a privilege to have gotten to speak with such a legendary person. I wish I had a transcript of our conversation, but alas I don't. I'm still surprised at the ease at which I was able to meet him, but ultimately I'm glad I took advantage of an opportunity most people don't realize is available to them.


Mimi Silbert: the greatest hacker in the world

Here's a problem for you: build an organization that transforms thousands of nasty, violent, dope fiend ex-cons into decent, productive members of society. OK, now do it with no money and no staff. And achieve a >90% success rate. While you're at it, make the organization double as a business that provides valuable services to the community. And make the whole thing self-sustaining.

Amazingly, Mimi Silbert accomplished this very feat. She's been at it for 35 years and her organization is called the Delancey Street Foundation . I've been scouring the web for every bit of information I could find about how Delancey Street operates, and simply put, it's the most spectacular and innovative organization I've ever come across.

Click to read more ...


Tips for Optimizing Cascading Flows

Here's a few tips for optimizing your Cascading flows. I also recommend checking out "7 Tips for Improving MapReduce Performance" for general MapReduce optimization tips.

Click to read more ...


The mathematics behind Hadoop-based systems

I wish I had known this a year ago. Now, with some simple mathematics I can finally answer:

  • Why doesn't the speed of my workflow double when I double the amount of processing power?
  • Why does a 10% failure rate cause my runtime to go up by 300%?
  • How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
  • How many machines should I have in my cluster to be adequately performant and fault-tolerant?

All of these questions are neatly answered by one simple equation:

Click to read more ...

Page 1 ... 1 2 3 4 5