hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Venner <ja...@attributor.com>
Subject Re: Query against different data types within HDFS using Map/Reduce
Date Mon, 05 May 2008 16:41:26 GMT
We do this all the time.
In one case we have the mapper work out the input type by examining the 
input file name and the record data. We tend to do this for the textual 
keyTABvalue records

In another case we have a container object that can hold any writable, 
that we pass around. We do this for data that has binary data that is to 
large to bother base64 encoding, or where we explicitly have to reduce 
multiple data types where we can't readily tell what the data type is.

Ted Dunning wrote:
> You just have to write an adapted input format that reads multiple kinds of input.
> It can key off the contents of the file or the name.  Depending on names is bad, but
has a long lineage so people tend to deal with it reasonably well.
> It isn't very hard to write.
> -----Original Message-----
> From: Kayla Jay [mailto:kaylais30@yahoo.com]
> Sent: Mon 5/5/2008 6:18 AM
> To: core-user@hadoop.apache.org
> Subject: Query against different data types within HDFS using Map/Reduce
> Has anyone come across this scenario and if not, does anyone have any suggestions?
> What if you store different types of data within HDFS.  You store XML, text, binary,
sequence files, etc.  You now want to run a query against ALL of the data stored within HDFS
via a map/reduce job.  How do you do this if the data input is different types?
> For example, (simplest), you want to find all the terms/words matching a pattern and
count and return where they are within each data source.  Even the example of word count could
be an example but given that not all data is textual line-by-line.  The terms/words could
be contained within XML or against a sequence file or some other format that is stored in
your HDFS.  What if you want to find those terms/words against ALL data sets that may not
be same format stored within HDFS.
> I understand that your Map/Reduce jobs specify a specific input format upfront, however,
if you have different data formats within HDFS and you want to run the exact query against
all formats within 1 map/reduce job, how is this even possible?
> Can you even run a single query in a single map/reduce job against all the data across
HDFS that is in different formats?
> If not, any suggestions on how to handle this?  
> Thanks.
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message