hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard K. Turner" <...@petersontechnology.com>
Subject File Per Column in Hadoop
Date Mon, 10 Mar 2008 18:01:38 GMT

I have found that storing each column in its own gzip file can really speed up processing
time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz
and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this
I think I would need to write an input format, which I can look into.  However, I want to
avoid the situation where a map task reads column files from different nodes.   To avoid this
situation, all columns files derived from the same CSV file must be co-located on the same
node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep
all files in dir csv_file1 together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message