incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BLUR-220) Support for humongous Rows
Date Fri, 05 Dec 2014 17:07:12 GMT

     [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron McCurry updated BLUR-220:
-------------------------------
    Attachment: MergeDeltaIndex.java
                MaskedAtomicReader.java
                blur_partial_update_v2.csv

I have been working on this task for bulk ingest of partial Rows.  Above is an implementation
of performing a sorted merge of Row parts and adding the result to the main index.  This would
be like the current index importer adding in a new segment of Rows to the shard.

I have also appending some results of timing between performing the merge and just reading
and re-indexing the partial Rows.

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MaskedAtomicReader.java, MergeDeltaIndex.java,
MyEarlyTerminatingCollector.java, SlabAllocator.java, SlabRAMDirectory.java, SlabRAMFile.java,
SlabRAMInputStream.java, SlabRAMOutputStream.java, TestSearch.java, TestSearch.java, blur_partial_update_v2.csv,
test_results.txt
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message