Cascalog Presentation at Bay Area Clojure User Group

Here are the slides from my presentation about Cascalog at the Bay Area Clojure User Group last night:


New Cascalog features: outer joins, combiners, sorting, and more

In the first tutorial for Cascalog, I showed off many of Cascalog's powerful features: joins, aggregates, subqueries, custom operations, and more. Since Cascalog's release a couple weeks ago, I've added a number of new features to Cascalog that seriously increase the expressiveness and performance of the language without compromising its simplicity or flexibility.

Click to read more ...


Introducing Cascalog: a Clojure-based query language for Hadoop

I'm very excited to be releasing Cascalog as open-source today. Cascalog is a Clojure-based query language for Hadoop inspired by Datalog.


  • Simple - Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.
  • Expressive - Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.
  • Interactive - Run queries from the Clojure REPL.
  • Scalable - Cascalog queries run as a series of MapReduce jobs.
  • Query anything - Query HDFS data, database data, and/or local data by making use of Cascading's "Tap" abstraction
  • Careful handling of null values - Null values can make life difficult. Cascalog has a feature called "non-nullable variables" that makes dealing with nulls painless.
  • First class interoperability with Cascading - Operations defined for Cascalog can be used in a Cascading flow and vice-versa
  • First class interoperability with Clojure - Can use regular Clojure functions as operations or filters, and since Cascalog is a Clojure DSL, you can use it in other Clojure code.

Click to read more ...


Fun with equality in Clojure

I ran into some very non-intuitive behavior from Clojure recently. See if you can guess what "foo" is in the following examples:

Example 1:

user=> foo
user=> (= foo 1)
user=> (= [foo 2] [1 2])
user=> (= {foo 2} {1 2})

Example 2:

user=> foo
user=> (= foo false)
user=> (when foo (println "shouldn't print?"))
shouldn't print?

Yikes, huh? Here are the answers:

Example 1: (def foo (Long. "1"))

Example 2: (def foo (Boolean. false))

For example 1, the map equality breaks down because Long and Integer have different hashcodes for the same numeric value. In example 2, Clojure considers anything besides false or nil to be true in a conditional, so that means a false Boolean object will be true in a conditional even though it's equal to "false".

I would definitely consider #1 a bug, as part of the contract of equality is that two equal objects have the same hashcode. #2 is more debatable, but it seems more intuitive that the Boolean object false be considered false in conditionals as well.

You should follow me on Twitter here.


Migrating data from a SQL database to Hadoop

I wrote about the various options available for migrating data from a SQL database to Hadoop, the problems with existing solutions, and a new solution that we open-sourced on the BackType tech blog. The tool we open-sourced is on GitHub here.


Proof that 1 = 0 using a common logical fallacy

Awhile ago I read a post by Daniel Levine that shows a formal proof of x*0 = 0. Here's a reprint of the proof:

  1. y = y (identity axiom)
  2. y - y = 0 (arithmetic)
  3. x*(y - y) = 0 (substitution)
  4. x*y - x*y = 0 (distributive)
  5. x*y = x*y (arithmetic)

The logic of this proof is that since we can reduce x*0 = 0 to the identity axiom, x*0 = 0 is true. Unfortunately, this is not logically sound.

Click to read more ...


Thrift + Graphs = Strong, flexible schemas on Hadoop

There are a lot of misconceptions about what Hadoop is useful for and what kind of data you can put in it. A lot of people think that Hadoop is meant for unstructured data like log files. While Hadoop is great for log files, it's also fantastic for strongly typed, structured data.

In this post I'll discuss how you can use a tool like Thrift to store strongly typed data in Hadoop while retaining the flexibility to evolve your schema. We'll look at graph-based schemas and see why they are an ideal fit for many Hadoop-based applications.

Click to read more ...


Follow-up to "The mathematics behind Hadoop-based systems"

In a previous post, I developed an equation modeling the stable runtime of an iterative, batch-oriented workflow. We saw how the equation explained a number of counter-intuitive behaviors of batch-oriented systems. In this post, we will learn how to measure the amount of overhead versus dynamic time in a workflow, which is the first step in applying the theory to optimize a workflow.

Click to read more ...


Introducing "Nanny" - a really simple dependency management tool

Dependency management in software projects is a pretty simple problem when you think about it. A tool to manage dependencies just needs to do three things:

  1. Provide a mechanism to specify the direct dependencies to a project
  2. Download the transitive closure of dependencies to a project
  3. Publish packages that can be used as a dependency to other projects

Some languages have good dependency management systems - for example, rubygems. Others, like Java, have tools like Maven which I would call a complex solution to a simple problem. You shouldn't need to buy a book to understand the solution to such a simple problem. Plus, these dependency management systems are all language specific.

I've seen companies do crazy things to manage their dependencies. One company, to manage their jar files, would put all the jars that any project might need in a special "jars" project. You would then need to setup a JARS_HOME environment variable and be sure to update the jars project if you need any of the dependencies. If you needed an older version of something - forget about it. Plus it made deploys a huge pain, as each project had to ship with dependencies it didn't even use.

Enter Nanny.

Nanny makes it really easy to setup an internal repository to manage dependencies between projects. I spent a night hacking out Nanny and we're finding it incredibly useful at BackType. We manage dependencies between all our Java/Clojure projects using it, we distribute custom builds of Hadoop and Cassandra with it, and we're starting to use it to manage dependencies between our Python projects.

Nanny is hosted at GitHub and comes with documentation to get you started in no time.

You should follow me on Twitter here.


Why so many research papers are so hard to understand

Wondering why so many research papers are so hard to read? I got some great words of wisdom from Professor Jean-Claude Latombe on the subject back when I was in his research group at Stanford. He described two strategies people employ for getting your paper published in a journal. The first is to do some great research and write the results up in a clear, well-written way.

Sometimes, however, you may invest time in research and end up with not particularly interesting results. If you wrote a paper describing your insignificant research in a clear manner, then your paper would clearly be bad and would be rejected. A strategy to employ in this scenario is to present your research in a complex, non-straightforward manner. Now no one will ever say your paper is great, but people will be less likely to say your paper is flat out bad (after all, it sounds like you were researching something really complex!). So your paper will fall into the middle of the pack which may be good enough to get published.

For an incredibly vivid illustration of this strategy, check out SCIgen, an automatic CS paper generator that got a randomly generated paper accepted into a conference.

You should follow me on Twitter here.