incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <>
Subject [jira] [Commented] (BLUR-18) Rework the MapReduce Library to implement Input/OutputFormats
Date Sun, 04 Nov 2012 02:27:12 GMT


Aaron McCurry commented on BLUR-18:

I have created a remote branch of 0.2-dev-mr-formats.  Also I think that we need to create
some new Writable types for the InputFormat.  I'm thinking DocLocation (to contain the shard
index, and the document id) as the key, and a Document Writable object for carrying the Thrift
Document data as the value, from there we can work on the InputSplits.  I know I have been
back and forth on this but I think that we need to make the split be for each shard not per
server.  My reasoning here is because in the event of a shard server failure during a MapReduce
job, it will be easier to rerun each shard then to rerun each server.  This is because the
shards in the down shard server we be evenly spread out across the cluster of remaining shard

I should have some more time tomorrow to discussion and rework/implement/review.  Thanks for
the good start!
> Rework the MapReduce Library to implement Input/OutputFormats
> -------------------------------------------------------------
>                 Key: BLUR-18
>                 URL:
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>             Fix For: 0.2.0
>         Attachments: 0001-BLUR-ID-18-Created-New-Version-of-Files.patch
> Currently the only way to implement indexing is to use the BlurReducer.  A better way
to implement this would be to support Hadoop input/outputformats in both the new and old api's.
 This would allow an easier integration with other Hadoop projects such as Hive and Pig.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message