incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Thu, 24 Oct 2013 14:12:01 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804223#comment-13804223
] 

Aaron McCurry commented on BLUR-220:
------------------------------------

We should probably look to integrate the the slab feature into the block cache subsystem in
Blur, there already a lot of logic there for off heap allocation and management.  It's integrated
into the CacheDirectory (v2) if you want to a look.

I have prototyped some logic that actually uses the RAMDirectory with a swap out mechanize
like you have described above.  I got good results with the NRT updating.  Opening the index
on avg ~1ms with an update rate of one a ms.

However I think that what we are talking about here relates more to NRT updates than huge
rows.  I do have a concern about your proposed FilteredReader, it's a performance concern.

Let's say that we go to update a row by adding a single record to it.  And we have to merge
the existing records from a row that exists in a large segment.  Say 5 million documents with
50 million terms, the FilteredReader will have to walk the entire field -> term -> doc
-> position tree to locate the pieces of the index that are related to the row in question.
 It's like a full table scan.  Right?

I would like to continue the thread on the changes to the NRT updates (RAMDirectory thing
swap) but we should create a new issue to continue the discussion.

Thanks!

Aaron


> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MyEarlyTerminatingCollector.java, SlabAllocator.java,
SlabRAMDirectory.java, SlabRAMFile.java, SlabRAMInputStream.java, SlabRAMOutputStream.java,
test_results.txt, TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message