incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Sat, 12 Oct 2013 17:09:42 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793423#comment-13793423
] 

Aaron McCurry commented on BLUR-220:
------------------------------------

I have attached some prototypes for doing query time joins.  Basic results are as follows:

30,000 Documents
Small query 0.455 ms
Large query 8.119 ms

300,000 Documents
Small query 0.547 ms
Large query 92.168 ms

3,000,000 Documents
Small query 1.428 ms
Large query 3167.428 ms

30,000,000 Documents
Small query 1.698 ms
Large query 64137.19 ms

As I expected the large query, which basically is a hit on all the documents increases dramatically
when the documents in the index increase.  So if we are to move forward with this approach
we will need to search different segments in different ways.  Basically if we search segments
created from NRT updates with this approach and search merged segments with the existing approach
then we should have performance pretty close to what it is today with the benefit of not having
to reindex the row for every record mutate.

This approach has 2 main problems to be solved.

The first is the ability to do merges and colocate the records for a given row during the
merge.  This will likely require a custom SortingMergePolicy.

The second is the ability to split the logical query into 2 different queries based on the
segment and still get the right answer based on a mixed approach.  This will require some
custom query logic that will be based on the existing SuperQuery object and the lucene-join
project.

This will be fairly complex, but if it's solved this will resolve one of the biggest performance
issues in Blur to date.

Aaron

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: CreateIndex.java, test_results.txt, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message