hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: File Per Column in Hadoop
Date Tue, 11 Mar 2008 15:29:43 GMT
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't
approach it as an inputformat problem (because of the near impossibility of colocating all
these files) - but perhaps extend the compression libraries in Hadoop - so that the library
understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed
sequencefile) would actually be stored as a set of compressed blocks (each block representing
a column). This would give most of the benefits of columnar compression (not all - because
one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code
seems hairy - but that's probably just ignorance). We would also love to get to this stage
(we already have the metadata with each file) - but i think it would take us many months before
we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing
time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz
and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this
I think I would need to write an input format, which I can look into.  However, I want to
avoid the situation where a map task reads column files from different nodes.   To avoid this
situation, all columns files derived from the same CSV file must be co-located on the same
node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep
all files in dir csv_file1 together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message