incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <>
Subject [jira] [Commented] (BLUR-18) Rework the MapReduce Library to implement Input/OutputFromats
Date Fri, 12 Oct 2012 02:51:02 GMT


Aaron McCurry commented on BLUR-18:

So I would approach the InputFormat and OutputFormats as separate issues, perhaps we should
create 2 sub tasks one for each.

How the InputFormat works is really up for debate.  The easiest implementation could be a
simple lucene document to BlurRecord/Row converter that opens the each shard of the table
in a separate mapper and just reads through the index in a brute force scan.  This approach
has a few problems, the first is that if the index is being updated by the shard servers then
the segment files need to be protected/held so that they are not deleted out from underneath
the mapper.  The second is that it is a brute force approach that doesn't really allow for
blur/lucene queries to be executed against the index without opening the index in the mapper
for querying.  The problem with opening in the mapper is that there isn't typically enough
extra memory in the mapper to have an effective block cache for any kind of performance.

An alternate solution is to build the InputFormat against the shard server thrift api, but
I don't think that they can handle iterating over large amount of blur records.  My suggestion
is to put off the InputFormat until the new-api-prototype is in place, or we can start integrating
into that branch now.  My reasoning for this is that the thrift api and the new server is
designed to iterate over the entire result set.  I'm getting pretty good performance with
it right now, but it's not setup to be distributed yet.  We can work on that together if you
would like.

For the OutputFormat, porting the functionality in the BlurReducer to run in the OutputFormat
should be fairly straight forward.

Now that think about it more, it might be prudent to go ahead and start working on both the
Input and Output Formats in the new-api-branch instead of trying to get it working the 0.1.x
api.  The api and data structure is so much simple in the new-api-branch.

What do you think?
> Rework the MapReduce Library to implement Input/OutputFromats
> -------------------------------------------------------------
>                 Key: BLUR-18
>                 URL:
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
> Currently the only way to implement indexing is to use the BlurReducer.  A better way
to implement this would be to support Hadoop input/outputformats in both the new and old api's.
 This would allow an easier integration with other Hadoop projects such as Hive and Pig.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message