« Thrift + Graphs = Strong, flexible schemas on Hadoop | Main | Introducing "Nanny" - a really simple dependency management tool »

Follow-up to "The mathematics behind Hadoop-based systems"

In a previous post, I developed an equation modeling the stable runtime of an iterative, batch-oriented workflow. We saw how the equation explained a number of counter-intuitive behaviors of batch-oriented systems. In this post, we will learn how to measure the amount of overhead versus dynamic time in a workflow, which is the first step in applying the theory to optimize a workflow.

Recall that we started with the equation for the runtime of a single iteration of a workflow:

Runtime = Overhead + {Time to process one hour of data} * {Hours of Data}
T = O + P * H

We ended with the equation for the stable runtime of a workflow that runs repeatedly:

{Stable Runtime} = Overhead / (1 - {Time to process one hour of data}
T = O / (1 - P)

Measuring O and P

The first step towards utilizing this theory for optimizing your workflow will be to measure the values of O and P for your workflow. This can be difficult if the cluster is shared with lots of other jobs, as the P for each run will vary. Let's start with the case where your workflow is the only application on the cluster.

A flawed first attempt at measurement

One idea for measuring O and P would be to run the workflow without any data. Using the equation for the runtime of a single run of the workflow, we can do the following math:

T = O + P * H
T = O + P * zero
O = T

Running without any data sets H to zero, which gives us the value of O and lets us trivially determine P.

Unfortunately, although appealing this approach will not work. Normally, a Hadoop job has more tasks than exist task slots on the cluster. It can take a few minutes for a job to "get going" and achieve full velocity by utilizing all the available task slots on the cluster. The time it takes to "get going" is normally a constant amount of time and so is captured by our O variable. When we run a job with a tiny amount of data, the job will finish before utilizing the whole cluster, skewing our measurement of O.

Better way to measure O and P

Since we have two variables and can't zero out the equation, we are going to need two measurements of O and P to be able to solve for them. We already know that each measurement has to be done with a significant amount of data so as to avoid skewing the measurement.

The variable we can most easily control is the overhead O. By adding a "sleep(1 hour)" call in the beginning of the workflow, we effectively increase O by one hour. With this insight, we can utilize the following procedure to get an approximate measurement of O and P:

  1. Run workflow repeatedly until runtime stabilizes. This gives us our first measurement T1 = O / (1 - P)
  2. Add a "sleep(1 hour") call in the beginning of the workflow.
  3. Run workflow repeatedly until runtime stabilizes. This gives us our second measurement T2 = (O + 1) / (1 - P)
  4. Solve the two equations for O and P.
  5. Don't forget to remove the "sleep(1 hour)" call from the workflow!

Here is the solution to those equations:

T1 = O / (1 - P)
T2 = (O + 1) / (1 - P)

T1 and T2 are known, O and P are unknown

T2 / T1 = (O + 1) / O
T2 * O = T1 * (O + 1)
T2 * O = T1 * O + T1
O * (T2 - T1) = T1
O = T1 / (T2 - T1)

T1 = (T1 / (T2 - T1)) / (1 - P)
1 - P = 1 / (T2 - T1)
P = 1 - 1 / (T2 - T1)

O = T1 / (T2 - T1)
P = 1 - 1 / (T2 - T1)

Measuring O and P with the existence of other jobs on the cluster

Having other jobs run on the cluster with the workflow you are trying to measure will skew the results, as those jobs use up resources and modify the P value for your workflow. Unless the utilization of the cluster by other jobs is constant, your P value will fluctuate from iteration to iteration causing difficulties in measuring O and P.

What I recommend in this scenario is to measure the average runtime of the workflow over many iterations before and after increasing O with "sleep(1 hour)". There will definitely be more error with this technique, and it makes sense in this scenario to do a couple more measurements with "sleep(2 hours)" and "sleep(3 hours)" to weed out the variance.


Now you should be able to measure the values of O and P for your workflow. In my next follow-up, I'll discuss how you can use these O and P values to guide your optimization efforts.

You should follow me on Twitter here.

Reader Comments (2)

[...] UPDATE: Check out the follow-up to this post. [...]

Social comments and analytics for this post...

This post was mentioned on Twitter by nathanmarz: Follow-up to "The mathematics behind Hadoop-based Systems" http://bit.ly/atI8E5...

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>