incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <>
Subject [jira] [Commented] (BLUR-445) Remove online mutates from the Blur thrift api
Date Wed, 11 Nov 2015 15:02:11 GMT


Aaron McCurry commented on BLUR-445:

First I would want to add a timestamp to the Document/Row and SubDocument/Record objects to
provide a most recent data wins when multiple mutates to the data occur rapidly.  Next the
index manager daemon would read from data sources (think files, dirs, queues, etc) for properly
formed data mutations.  Then the index manager would perform the necessary MR (or insert other
processing tech) to create index deltas.  Those index deltas would then be merged into the
indexes (this is a change currently the shard servers do this) and committed by creating a
HDFS snapshot.  After it's committed the shard servers would move to the newly committed snapshot
of indexes for the given table.  After all the shard servers moved to serving the new indexes
in the new snapshot, old HDFS snapshots could be removed.

Mutates could still be achieved in a similar way they are today via Kafka (or something similar),
but the data would not be readable until after the index manager brought the data online.
 This is why the timestamp is needed for updates.

Also if the amount of data that is being ingested is very small the index manager could just
update the indexes directly without the bulk update.  This would allow for more timely updates
to occur.

> Remove online mutates from the Blur thrift api
> ----------------------------------------------
>                 Key: BLUR-445
>                 URL:
>             Project: Apache Blur
>          Issue Type: Improvement
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
> The primary use case for Blur is for massive ingestion of information to be indexed and
searched.  Currently I believe the system has been made overly complex due to the atomic operations
in the online index mutation system.  It forces the shard servers to have writers open to
each of the indexes in the given table, this requires a lot of memory, cpu, and file resources
per shard.
> Currently the system only allows for mutates to be atomic when mutating a single row.
 Batch mutates are not atomic.
> I propose that we move all index mutations to the bulk indexing approach and utilize
hdfs snapshots for commiting index information within a given table.  This will allow the
controller and shard servers to become readonly with respect to the indexes.
> Assuming we move forward with this approach a new daemon will need to created, and index
manager.  This daemon will coordinate indexing (MR, Spark, Tez, Flink, etc) and merging globally
for the cluster.

This message was sent by Atlassian JIRA

View raw message