hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: partition file by content based through HDFS
Date Sun, 11 May 2014 21:42:59 GMT
To second Mirko, HDFS isn’t concerned with content or formats.  That would be analogous to
asking specific content to end up on specific disk sectors in a normal file.  If you want
to partition data by content, use MapReduce/Pig/Hive etc to segregate the data into files,
perhaps naming the files to indicate the key split.

But this kind of begs the question “why”?  MapReduce has built-in support for data partitioning
on the fly in the “mappers” and you don’t really need to do anything.  Is that too slow
for your needs?


From: Mirko Kämpf [mailto:mirko.kaempf@gmail.com]
Sent: Sunday, May 11, 2014 2:54 PM
To: user@hadoop.apache.org
Subject: Re: partition file by content based through HDFS


HDFS blocks are not "content aware". Such a separation like you requested, could be done via
Hive or Pig with some lines of code, than you would have multiple files which can be organized
in partitions as well, but such partitions are on a different abstraction level, not on blocks,
but within hive tables.

Best wishes,

2014-05-11 14:41 GMT+01:00 Karim Awara <karim.awara@kaust.edu.sa<mailto:karim.awara@kaust.edu.sa>>:
When a user is uploading a file from the local disk to its HDFS, can I make it partition the
file into blocks based on its content?  Meaning, if I have a file with one integer column,
can i say, I want the hdfs block to have even numbers?

Best Regards,
Karim Ahmed Awara

This message and its contents, including attachments are intended solely for the original
recipient. If you are not the intended recipient or have received this message in error, please
notify me immediately and delete this message from your computer system. Any unauthorized
use or distribution is prohibited. Please consider the environment before printing this email.

View raw message