incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravikumar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-220) Support for humongous Rows
Date Mon, 21 Oct 2013 11:36:42 GMT

    [ https://issues.apache.org/jira/browse/BLUR-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800557#comment-13800557
] 

Ravikumar commented on BLUR-220:
--------------------------------

I am describing the 10,000 ft-approach of things.

1. Lets have 2 RAMDirectories per-shard, in each shard-server. 
    One for buffering incoming documents[Active-RAM] and another for merge-sorting and flushing
to HDFS. [FlushableRAM]

2. Based on number of documents added or absolute bytes consumed Active-RAM gets swapped with
the FlushableRAM per-shard.

3. For each incoming mutation, add mutation to Active-RAM and delete that mutation from FlushableRAM
and HDFS indexes.
      a. getActiveRAMIndexWriter().updateDocuments(List<Documents>); [Contains all record-specific
mutations per-row]
      b. getFlushableRAMIndexWriter().delete(Query.... rowIdAndRecordIdQueries); [A set-of-queries
containing rowId & recordId terms]
      c. getHDFSIndexWriter().delete(Query.... rowIdAndRecordIdQueries);
      d. Record in Blur Transaction Log.

4. Step 3 continues, until step 2 is violated. A swap of Active-RAM & FlushableRAM happens.
Background thread starts merge-sorting and flushing, from FlushableRAM to HDFS, to co-locate
all records of a row.

5. It is highly likely that when Flushing happens, deletes will arrive to FlushableRAM index
by way of Step 3. These are accumulated in a DeleteQueue and committed alongside Step 4.

6. Incoming searches will involve 3 IndexSearchers, one each on Active-RAM + Flushable-RAM
+ HDFS-Index. A given row-record will be found in only one index, no matter the number of
updates it has gone through.

Please let know your comments, on this approach. I have a few questions also, but postponing
it for future.


    

> Support for humongous Rows
> --------------------------
>
>                 Key: BLUR-220
>                 URL: https://issues.apache.org/jira/browse/BLUR-220
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>         Attachments: Blur_Query_Perf_Chart1.pdf, CreateIndex.java, CreateIndex.java,
CreateSortedIndex.java, FullRowReindexing.java, MyEarlyTerminatingCollector.java, test_results.txt,
TestSearch.java, TestSearch.java
>
>
> One of the limitations of Blur is size of Rows stored, specifically the number of Records.
 The current updates are performed on Lucene is by deleting the document and re-adding to
the index.  Unfortunately when any update is perform on a Row in Blur, the entire Row has
to be re-read (if the RowMutationType is UPDATE_ROW) and then whatever modification needs
are made then it is reindexed in it's entirety.
> Due to all of this overhead, there is a realistic limit on the size of a given Row. 
It may vary based the kind of hardware that is being used, as the Row grows in size the indexing
(mutations) against that Row will slow.
> This issue is being created to discuss techniques on how to deal with this problem.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message