thoughts from the red planet

Top

Follow Nathan on

Tuesday

Mar232010

Migrating data from a SQL database to Hadoop

Tuesday, March 23, 2010

Hadoop

I wrote about the various options available for migrating data from a SQL database to Hadoop, the problems with existing solutions, and a new solution that we open-sourced on the BackType tech blog. The tool we open-sourced is on GitHub here.

3 Comments |

Reader Comments (3)

hi - cascading.dbmigrate allows users to specify the number of 'chunks', and by default it is set to 4. my question is - what is the best practice of determining the right number?

my usecase is I need to incrementally export records from a table of size in millions weekly. should I export the whole table as 1 single big file in HDFS or multiple files are recommended?

thx.

July 7, 2010 |

Jack

Generally you want output files to be 64-128 MB, which usually corresponds to a few million records in a chunk. So you should choose # chunks such that each chunk will have that # of records.

July 7, 2010 |

nathanmarz

why is it that we want to have files to be 64-128mb? whats the problem with having a 1GB file in HDFS and let hadoop handle the splitting etc? sorry it is an obvious question. thx.

July 7, 2010 |

Jack

Post a New Comment

Enter your information below to add a new comment.

Author:

Author Email (optional):

Author URL (optional):

Post:

↓ | ↑

Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>