incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-18) Rework the MapReduce Library to implement Input/OutputFormats
Date Tue, 06 Nov 2012 02:36:12 GMT

    [ https://issues.apache.org/jira/browse/BLUR-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491137#comment-13491137
] 

Aaron McCurry commented on BLUR-18:
-----------------------------------

I don't think that the BlurDocLocation needs to be Comparable (implement Raw Comparators).
 It's really there to let you know what document you are reading at that moment.  The reasoning:
 When each InputSplit opens a session on a given shard server, the act of opening the session
creates a temporary snapshot of the indexes.  This snapshot guarantees that the document ids
will not change while the session is open.  So the document location is just the shard index
for the given table plus the internal Lucene document id.  After the session is closed, or
if another session is created before or after the InputSplit creates it's session the internal
Lucene document ids may have changed.  This is due to near real-time updates and or merges
that have taken effect.

In any event, the document location is only valid while the session is open.  I just had a
thought, should we just take the 2 integers (shard index + Lucene internal document id) and
make them a single Long value?  Then we could just use long writable and I could simplify
the thrift API to represent this as well.  What do you guys think?
                
> Rework the MapReduce Library to implement Input/OutputFormats
> -------------------------------------------------------------
>
>                 Key: BLUR-18
>                 URL: https://issues.apache.org/jira/browse/BLUR-18
>             Project: Apache Blur
>          Issue Type: Improvement
>            Reporter: Aaron McCurry
>             Fix For: 0.2.0
>
>         Attachments: 0001-BLUR-ID-18-Created-New-Version-of-Files.patch, 0001-BLUR-ID-18-New-Writables.patch
>
>
> Currently the only way to implement indexing is to use the BlurReducer.  A better way
to implement this would be to support Hadoop input/outputformats in both the new and old api's.
 This would allow an easier integration with other Hadoop projects such as Hive and Pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message