hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard K. Turner" <...@petersontechnology.com>
Subject RE: File Per Column in Hadoop
Date Tue, 11 Mar 2008 17:59:54 GMT

One other thing to add to list

3. overhead of parsing row to extract needed fields

It is good to know that the filesystem may read all data even if I do seek.  I did not know
that.  I knew HDFS had CRCs but I did not think of the implications.

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 1:15 PM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
pretty cool - i think this would be a great contrib.

on columnar access over row organized data, breaking down downsides more clearly, there is:
1. overhead of reading more data from disk into memory
2. overhead of decompressing extra data

with a sub-block per column - #2 is avoided. #1 is hard/impossible to avoid (the underlying
file system would probably read sequentially anyway - and besides there's checksum verification
in dfs itself that would require entire data to be read)

but so far from what i have seen - there is usually an excess of serial read bandwidth in
a hadoop cluster. to the extent that extra data reads can cause hidden cpu cost (if they cause
memory bandwidth to max out) - this could be a concern. but - given other inefficiencies in
memory usage including language and endless copies of data being made in the io path - i would
speculate that (right now) #1 is not that big a concern for Hadoop.


-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Tue 3/11/2008 9:33 AM
To: core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
This is somewhat similar to what I am doing.  I create a set of compressed columns per input
CSV file.  You are saying take a fixed number of rows and create compressed column blocks.
 As long as you do this with a large enough row subset you will get alot of the benefit of
compressing similar data.

In addition to compression, another benefit of file per column is drastically reduced I/O.
 If I have 100 columns and I want to analyze 5, then I do not have to even read, decompress,
and throw away the other 95 columns.  This can drastically reduce I/O and CPU utilization
and increase cache (CPU and disk) utilization.  To get this benefit you would need metadata
that indicates where each column block starts within a record.  This metadata will allow seeking
to the beginning of columns of interest.

I will look into creating another file format like the SequenceFile that supports this structure
and an input format to go along with it.  First order of business will be to see if the input
format can support seeking.

Keith

-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma@facebook.com]
Sent: Tue 3/11/2008 11:29 AM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Subject: RE: File Per Column in Hadoop
 
it would be interesting to integrate knowledge of columnar structure with compression. i wouldn't
approach it as an inputformat problem (because of the near impossibility of colocating all
these files) - but perhaps extend the compression libraries in Hadoop - so that the library
understood the structured nature of the underlying dataset.

One would store all the columns together in a single row. But each block of (a compressed
sequencefile) would actually be stored as a set of compressed blocks (each block representing
a column). This would give most of the benefits of columnar compression (not all - because
one would only be compressing a block at a time) - while still being transparent to mapreduce.

So - doable i would think and very sexy - but i don't know how complex (the compression code
seems hairy - but that's probably just ignorance). We would also love to get to this stage
(we already have the metadata with each file) - but i think it would take us many months before
we got there.

Joydeep



-----Original Message-----
From: Richard K. Turner [mailto:rkt@petersontechnology.com]
Sent: Mon 3/10/2008 11:01 AM
To: core-user@hadoop.apache.org
Subject: File Per Column in Hadoop
 

I have found that storing each column in its own gzip file can really speed up processing
time on arbitrary subsets of columns.  For example suppose I have two CSV files called csv_file1.gz
and csv_file2.gz.  I can create a file for each column as follows :

   csv_file1/col1.gz
   csv_file1/col2.gz
   csv_file1/col3.gz
     .
     .
     .
   csv_file1/colN.gz
   csv_file2/col1.gz
   csv_file2/col2.gz
   csv_file2/col3.gz
     .
     .
     .
   csv_file2/colN.gz


I would like to use this approach when writing map reduce jobs in Hadoop.  Inorder to do this
I think I would need to write an input format, which I can look into.  However, I want to
avoid the situation where a map task reads column files from different nodes.   To avoid this
situation, all columns files derived from the same CSV file must be co-located on the same
node(or nodes if replication is enabled).  So for my example I would like to ask HDFS to keep
all files in dir csv_file1 together on the same node(s).  I would also do the same for dir
csv_file2.  Does anyone know how to do this in Hadoop?

Thanks,

Keith





Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message