I wrote about the various options available for migrating data from a SQL database to Hadoop, the problems with existing solutions, and a new solution that we open-sourced on the BackType tech blog. The tool we open-sourced is on GitHub here.
hi - cascading.dbmigrate allows users to specify the number of 'chunks', and by default it is set to 4. my question is - what is the best practice of determining the right number?
my usecase is I need to incrementally export records from a table of size in millions weekly. should I export the whole table as 1 single big file in HDFS or multiple files are recommended?
Generally you want output files to be 64-128 MB, which usually corresponds to a few million records in a chunk. So you should choose # chunks such that each chunk will have that # of records.
why is it that we want to have files to be 64-128mb? whats the problem with having a 1GB file in HDFS and let hadoop handle the splitting etc? sorry it is an obvious question. thx.
Notify me of follow-up comments via email.