incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (BLUR-132) Create Index Snapshots
Date Tue, 13 Aug 2013 10:55:47 GMT

    [ https://issues.apache.org/jira/browse/BLUR-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738082#comment-13738082
] 

Aaron McCurry commented on BLUR-132:
------------------------------------

Thanks for the writeup!

-We can provide the backup in multiple ways : 
-     1. Backup all shards(all tables) on a shard server onto local filesystem(cluster/table/shard/files).
While restoring from backup, every shard server 
-        reads from its local filesystem and copies the shards onto HDFS. 
-     2. Bakup all shards on all shard servers onto a common HDFS location. While restoring
we would partition the shards onto shard servers.

I will offer a third option.
3. Snapshot the index and don't make a copy of the data.  We could simply leave the files
in the shard directory that are referenced in the snapshot and not allow the IndexDeletionPolicy
to remove the files.

In general I didn't really consider snapshots a way of creating a backup but rather a known
state of the index that was light and fast to create.  Although I think that you would have
to have snapshots to allow a backup to work correctly.  So perhaps we should create another
task to actually do the copying of a snapshot for backup.

-We should also have some mechanism to restore from a backup. For us to restore the index
from a backup, we might as well need a point-in-time copy of all the table descriptors. 

Agreed we should be to restore to a previous snapshot.  Possible pre-requirement for this
will be to change a table into read-only while it's running.  So that way we can close and
reopen the IndexWriter on each shard.  Also this could be broken out into another task after
snapshots are created.

-How are we planning to expose this snapshot functionality (Shell, API, BOTH)? 

Both, the shell just uses the api.

-Where are we even using LocalIndexServer?

We don't anywhere, it's legacy code that probably should be removed.

- I was able to take a backup by wrapping IndexDeletionPolicy with SnapshotDeletionPolicy
and then take a snapshot and copy all the files to a local file system. This technically works
even if the index is being actively updated, but the way in which the code is structured (DistributeIndexServer.openShard),
we would only get a BlurIndexReader when the shard is being updated. but the sample code I
have below is using the writer to take the snapshot. May be there is a different way? 

There were two reasons I wanted to create snapshots.
1. Primary - Create a static view of the index so that MapReduce jobs (or other external systems)
could open and use the indexes (from a snapshot) and they would not be changing while they
were being used.
2. Create the ability to snapshot commit points through time so that if I needed to backup
to a certain point I could and drop all the data afterward that point, I could.
3. Low priority - Run a shard off a certain commit point and allow the snapshot commit point
to be changed to any other snapshot as well as the head of the index.

- Also What happens when multiple sources try to add documents to the same shard simultaneously(using
the same IndexWriter)?

If you are asking about what happens to the index commit points or how we would deal with
the multiple sources.  We don't, the snapshots will only operate on committed data, so before
we create a snapshot we will need to call commit on the index.

- Would really love to know your thoughts and appreciate it if someone can fill in gaps in
my understanding. Thank You.

If you want to pick a starting point I think that #1 is a good place to start.
    1. Primary - Create a static view of the index so that MapReduce jobs (or other external
systems) could open and use the indexes (from a snapshot) and they would not be changing while
they were being used.

I can help you create or modify an IndexDeletetionPolicy to behave the way we need or help
with any other questions.  This I really going to be a feature that will likely grow and change
over time but I think we can start pretty basic for now.

Aaron




                
> Create Index Snapshots
> ----------------------
>
>                 Key: BLUR-132
>                 URL: https://issues.apache.org/jira/browse/BLUR-132
>             Project: Apache Blur
>          Issue Type: New Feature
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message