incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <>
Subject [jira] [Updated] (BLUR-220) Support for humongous Rows
Date Tue, 15 Oct 2013 12:42:42 GMT


Ravikumar updated BLUR-220:

    Attachment: Blur_Query_Perf_Chart1.pdf

I have modified the test case a little bit.

We have a table of results that I have attached along with the test-cases. 

I have assumed that IDs in this test-case correspond to RowIds of Blur.

"Unsorted" --> Scatter records across segments
"Optimize"--> Optimize every index into one single segment. All data is present in one
single segment
"Sort"--> Use SortMergePolicy and locate IDs together in some of the segments
"SortEarlyTerm" --> Same as above, but during search early-terminate already sorted segments

What do the results show?

1. "SortEarlyTerm" is quite powerful, when the number of rowIds are
small{<=10K} in number

2. As the rowIds increases, the optimized single segment outperforms
everything else, which is understandable.

3. There is a slight difference in results between an early-term and fully completing queries
on sorted segments. I guess this is open to interpretations.

There are 2 issues in Early-Term to keep in mind.

1. It can be done, only by throwing an exception per-segment. This is way too ugly and may
be a tad costly also.
2. All docs of a row are not examined. Hence scoring per-row is wrong. 

> Support for humongous Rows
> --------------------------
>                 Key: BLUR-220
>                 URL:
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>         Attachments: Blur_Query_Perf_Chart1.pdf,,,,, test_results.txt,,
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.

This message was sent by Atlassian JIRA

View raw message