incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <>
Subject [jira] [Commented] (BLUR-234) Create a softlink like capability in the HDFSDirectory
Date Fri, 25 Oct 2013 18:12:32 GMT


Aaron McCurry commented on BLUR-234:

Ok, so the basics of what IndexImporter is doing and where the problem lies.  The IndexImporter
is basically calling addIndex (or addDirectory) on the IndexWriter that is the main index
for the shard.  The normal operation of Lucene is to copy all the files from the /XXXXX.commit
directory to the main.  However this reads all the index files from the XXXXX.commit directory
and writes them into the main index.  This can be very low and used a lot of resources, so
the current process is basically intercept the copy call and actually move the HDFS file from
the XXXXX.commit index into the main, renaming the file as needed.  So because of the copy
this is a dangerous operation because if the files are moved and the shard process dies then
those files that were moved in are lost because Lucene deletes unknown files on writer open.

So my solution to this problem is create a SoftLinkDirectory directory so that instead of
moving a file from the XXXXX.commit index it creates softlink to the XXXXX.commit index. 
That way the in a failure no data is lost.  Let me know what you think about this approach.



> Create a softlink like capability in the HDFSDirectory
> ------------------------------------------------------
>                 Key: BLUR-234
>                 URL:
>             Project: Apache Blur
>          Issue Type: Sub-task
>          Components: Blur
>    Affects Versions: 0.3.0
>            Reporter: Aaron McCurry
>             Fix For: 0.3.0
> The problem we are trying to solve here is minimizing file copying.  During a merge of
an external index produced by MR into a shard index normally the index files are copied. 
In a lot of cases the new external index(es) are very large.  This can cause some serious
performance problems because all the new data would be copied into shard index.  Normally
this can happens across the cluster at the same time so it will likely turn into an IO storm.
> The current implementation in the IndexImporter that deals with this problem does so
by overriding method in the HDFSDirectory that moves the files in HDFS instead of copying.
 This makes those merges very fast, but it's risky because if the shard index writer doesn't
commit the changes the files are not moved back to their original location.  Instead they
are deleted, loss of data.
> So I'm preposing that we create a softlink system that allows for links to the be created
instead of being moved.  That way if the commit fails the links are removed and the original
data files are in the their original location.

This message was sent by Atlassian JIRA

View raw message