incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Thu, 24 Oct 2013 16:52:22 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804394#comment-13804394
] 

Ravikumar commented on BLUR-220:
--------------------------------

Thanks for the pointers on block-cache. I will look to integrate it. May be the on-heap alloc
can use this logic, while the off-heap can continue with the existing code.

I will create a new issue for the NRT updates.

Lets see the sequence

1. Tiny sorted segments make it to disk from RAM.
2. Future merges take place among already sorted segments. 
3. So, inside every segment each row will be co-located with all records. But still, these
rows will be scattered across segments.
4. The SortingMergePolicy impl uses TimSort underneath, which means it is almost O(n) for
already sorted data. Also, this is quite different from a linear-scan, as merges always try
to bulk fetch-and-write data. For actual comparisons, please look at the details at
https://issues.apache.org/jira/browse/LUCENE-4752?focusedCommentId=13605896&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13605896

As per the link, big segment merges are actually quite fast and on-par with normal merges,
provided the index uses no stored-fields. Otherwise, merges will be 2-3X slower.

Let me know, if you are convinced on this.

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MyEarlyTerminatingCollector.java, SlabAllocator.java,
SlabRAMDirectory.java, SlabRAMFile.java, SlabRAMInputStream.java, SlabRAMOutputStream.java,
test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message