incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BLUR-220) Support for humongous Rows
Date Thu, 17 Oct 2013 10:09:44 GMT

     [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ravikumar updated BLUR-220:
---------------------------

    Attachment: FullRowReindexing.java

"Now I think it's possible that we could come up with a mixed approach where use the join
query for recently updated rows and then merge them fully (somehow) back into the segment
as back-to-back documents again without reindexing the entire row again."

Even though I do not understand why we need this, I have attached an experimental patch that
avoids re-indexing an entire row but still copies relevant row-data to newer segments.

We create a FilteredAtomicReader that allows only docs with the input rowId and just copy
over all data to newer segments and also delete these docs in older segments.

Do let know if this comes anywhere near to what you are looking for.

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MyEarlyTerminatingCollector.java, test_results.txt,
TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message