« Fun with equality in Clojure | Main | Proof that 1 = 0 using a common logical fallacy »

Migrating data from a SQL database to Hadoop

I wrote about the various options available for migrating data from a SQL database to Hadoop, the problems with existing solutions, and a new solution that we open-sourced on the BackType tech blog. The tool we open-sourced is on GitHub here.

Reader Comments (3)

hi - cascading.dbmigrate allows users to specify the number of 'chunks', and by default it is set to 4. my question is - what is the best practice of determining the right number?

my usecase is I need to incrementally export records from a table of size in millions weekly. should I export the whole table as 1 single big file in HDFS or multiple files are recommended?


July 7, 2010 | Unregistered CommenterJack

Generally you want output files to be 64-128 MB, which usually corresponds to a few million records in a chunk. So you should choose # chunks such that each chunk will have that # of records.

July 7, 2010 | Unregistered Commenternathanmarz

why is it that we want to have files to be 64-128mb? whats the problem with having a 1GB file in HDFS and let hadoop handle the splitting etc? sorry it is an obvious question. thx.

July 7, 2010 | Unregistered CommenterJack

PostPost a New Comment

Enter your information below to add a new comment.
Author Email (optional):
Author URL (optional):
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>