incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Thu, 17 Oct 2013 00:11:42 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797465#comment-13797465
] 

Aaron McCurry commented on BLUR-220:
------------------------------------

Today when add/update of a row happens all the records are indexes against the indexwriter
as a collection of documents so that they are guaranteed to be back-to-back.  Currently this
is required for the Row Query ( http://incubator.apache.org/blur/docs/0.2.0/data-model.html#row_query
) to work properly.  Because of this requirement as the row increases in size it has to re-index
the row over and over again.  This means that writes take a huge hit on performance when you
are doing anything other than replacing the row.

Now I think it's possible that we could come up with a mixed approach where use the join query
for recently updated rows and then merge them fully (somehow) back into the segment as back-to-back
documents again without reindexing the entire row again.

The reason the complexity exists today is because the query time join (Row Query) when the
documents (records) are indexed together is negligible regardless of size of the index.  Think
of the worse case scenario for the query time join, and the same logical query with the Row
Query will be a few milliseconds instead of several seconds.

Aaron

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, MyEarlyTerminatingCollector.java, test_results.txt, TestSearch.java,
TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message