hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard K. Turner" <...@petersontechnology.com>
Subject RE: File Per Column in Hadoop
Date Tue, 11 Mar 2008 19:36:01 GMT
I have not seen that paper.  I will have to take a look at it.  I have read a few papers on
this subject.  For anyone interested, I think the first two pages of "Column-Stores For Wide
and Sparse Data" by Daniel J. Abadi give an excellent overview of the pros and cons of column
oriented datasets.


-----Original Message-----
From: Ashish Thusoo [mailto:athusoo@facebook.com]
Sent: Tue 3/11/2008 2:43 PM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
This is very interesting and very useful.

There was some work done in the database community to look at different
block organizations that boost cache and I/O performance and essentially
they also proposed a scheme similar to what you are talking about
(although at a database block level)

Link is below if you are interested in it and may give some good
insights while you are doing this (in case you have not already seen



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com] 
Sent: Tuesday, March 11, 2008 9:34 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop

This is somewhat similar to what I am doing.  I create a set of
compressed columns per input CSV file.  You are saying take a fixed
number of rows and create compressed column blocks.  As long as you do
this with a large enough row subset you will get alot of the benefit of
compressing similar data.

In addition to compression, another benefit of file per column is
drastically reduced I/O.  If I have 100 columns and I want to analyze 5,
then I do not have to even read, decompress, and throw away the other 95
columns.  This can drastically reduce I/O and CPU utilization and
increase cache (CPU and disk) utilization.  To get this benefit you
would need metadata that indicates where each column block starts within
a record.  This metadata will allow seeking to the beginning of columns
of interest.

I will look into creating another file format like the SequenceFile that
supports this structure and an input format to go along with it.  First
order of business will be to see if the input format can support


-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
it would be interesting to integrate knowledge of columnar structure
with compression. i wouldn't approach it as an inputformat problem
(because of the near impossibility of colocating all these files) - but
perhaps extend the compression libraries in Hadoop - so that the library
understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block
of (a compressed sequencefile) would actually be stored as a set of
compressed blocks (each block representing a column). This would give
most of the benefits of columnar compression (not all - because one
would only be compressing a block at a time) - while still being
transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex
(the compression code seems hairy - but that's probably just ignorance).
We would also love to get to this stage (we already have the metadata
with each file) - but i think it would take us many months before we got


-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop

I have found that storing each column in its own gzip file can really
speed up processing time on arbitrary subsets of columns.  For example
suppose I have two CSV files called csv_file1.gz and csv_file2.gz.  I
can create a file for each column as follows :


I would like to use this approach when writing map reduce jobs in
Hadoop.  Inorder to do this I think I would need to write an input
format, which I can look into.  However, I want to avoid the situation
where a map task reads column files from different nodes.   To avoid
this situation, all columns files derived from the same CSV file must be
co-located on the same node(or nodes if replication is enabled).  So for
my example I would like to ask HDFS to keep all files in dir csv_file1
together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?



  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message