incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Sat, 19 Oct 2013 12:25:43 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13799880#comment-13799880
] 

Ravikumar commented on BLUR-220:
--------------------------------

Thanks for the link. Now I understand how this is all related.

I was thinking of another idea, that I wanted to get your opinion on.

Now that we have a SortingMergePolicy from lucene, it is actually possible to co-locate all
records of a given row. This will work during a segment-merge, but newly added records of
the row will still be scattered and will take it's own time to participate in a merge.

Instead for an online indexing case, where we have records continuously trickling in for all
rows, will it be good to do something like Zoie's search system, where incoming operations
directly buffer to RAM and not to disk. Since we already have a transaction log, recovery
is in-built for Blur. The details are at https://code.google.com/p/zoie/wiki/ZoieSystem

The basic idea her is to divide allocated RAM into RAM-A and RAM-B. All document operations
go into RAM-A. When RAM-A is full, swap RAM-A and RAM-B. A custom searcher will wrap both
RAMDir & disk-based Dirs to return final set of results. This is almost similar to HBase
Memstore, except that we have 2 slots of memory.

Instead of us blindly flushing the full RAM to disk, we apply our SortingMergePolicy on this
RAM and then flush to disk. By this approach, even fresh segments will have all records per-row
co-located.

All existing functionalities of Blur can easily then work, right?

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MyEarlyTerminatingCollector.java, test_results.txt,
TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message