incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Thu, 17 Oct 2013 07:41:42 GMT


Ravikumar commented on BLUR-220:

I have 2 basic doubts.


                  Typically, in no-sql world a row-query is always by a rowId. But I gather
from this link [] that,
a row-query in Blur means actually a query across rowIds. 

                  In our system, we never query anything without the rowId, as rowId=userId.
It may be possible to have multiple rowIds in the
query in some rare-cases, but there is never a query without it. Which is why, in the test-cases
I submitted, all queries have a RowID ["id" field], whereas your test cases does not have
it. Am I correct in this understanding?

                  For a system like mine, it should still be fine to scatter documents across
segments as RowID filter-caches will be readily available and the rest is left to lucene.
Online indexing is so heavy that re-indexing even once is a major exercise for us. Definitely,
the current approach of continuous re-indexing is unviable, at least for us.

"Today when add/update of a row happens all the records are indexes against the indexwriter
as a collection of documents so that they are guaranteed to be back-to-back. Currently this
is required for the Row Query"

-- Technically, can you point me to the code where I can see this back-to-back dependency
for row-queries, or is it related to performance alone?

Apologies for my persistent questions. I am completely newbie and just now starting up with

> Support for humongous Rows
> --------------------------
>                 Key: BLUR-220
>                 URL:
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>         Attachments: Blur_Query_Perf_Chart1.pdf,,,,, test_results.txt,,
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.

This message was sent by Atlassian JIRA

View raw message