incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron McCurry (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (BLUR-439) HDFSDirectory fendcing issue
Date Fri, 12 Jun 2015 19:06:00 GMT

     [ https://issues.apache.org/jira/browse/BLUR-439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron McCurry updated BLUR-439:
-------------------------------
    Description: 
We recently had and issue that created a corrupt index.

What happened?

Shard Server 1 (SS1) owned a shard of a table (SH1) and was performing an index import when
a layout change in the table occurred mid import.  The segment version of this shard was at
"_0".  The SS1 server performed the work of the import by adding in files from an external
directory.  At this point the SH1 has not been committed and so the current committed version
is still "_0".  Although the files from the next segment version "_1" have been written and
not committed yet.

Then SH1 shard moved to another shard server (SS2) and once open the version that was open
was also "_0" which is correct.  During the move the directory lock is now owned by SS2 which
is also correct.  The SS2 process started the import process again for the external directory
that was not committed.  It also writes new files for the "_1" segment.

Now back on the SS1 server, the commit is underway and the directory lock is checked and an
exception is thrown because this process no longer owns the lock.  During the rollback the
SS1 server deletes what is thinks are the "_1" segments that it wrote but the files are actually
from the SS2 import process.  Once the files are deleted the abort and rollback is complete
and the index has returned to it's "_0" state.

However on the SS2 server the commit is moving forward for the "_1" segment (which now the
files have been deleted by the SS1) and the index is corrupted.


  was:
We recently had and issue that created a corrupt index.

What happened?

Shard Server 1 (SS1) owned a shard of a table (SH1) and was performing an index import when
a layout change in the table occurred mid import.  The segment version of this shard was at
"_0".  The SS1 server performed the work of the import by adding in files from an external
directory.  At this point the SH1 has not been committed and so the current committed version
is still "_0".  Although the files from the next segment version "_1" have been written and
not committed yet.

Then SH1 shard moved to another shard server (SS2) and once open the version that was open
was also "_0" which is correct.  During the move the directory lock is now owned by SS2 which
is also correct.  The SS2 process started the import process again for the external directory
that was not committed.  It also writes new files for the "_1" segment.

Now back on the SS1 server, the commit is underway and the directory lock is checked and an
exception is thrown because this process no longer owns the lock.  During the rollback the
SS1 server deletes what is thinks are the "_1" segments that it wrote but the files are actually
from the SS2 import process.  Once the files are deleted the abort and rollback is complete
and the index has returned to it's "_0" state.

However on the SS2 server the commit is moving forward for the "_1" segment (which now the
files have been deleted by the SS1) and the index is corrupted.



The problem happened when the SS1 server 


> HDFSDirectory fendcing issue
> ----------------------------
>
>                 Key: BLUR-439
>                 URL: https://issues.apache.org/jira/browse/BLUR-439
>             Project: Apache Blur
>          Issue Type: Bug
>          Components: Blur
>    Affects Versions: 0.2.4
>            Reporter: Aaron McCurry
>            Priority: Blocker
>             Fix For: 0.2.4
>
>
> We recently had and issue that created a corrupt index.
> What happened?
> Shard Server 1 (SS1) owned a shard of a table (SH1) and was performing an index import
when a layout change in the table occurred mid import.  The segment version of this shard
was at "_0".  The SS1 server performed the work of the import by adding in files from an external
directory.  At this point the SH1 has not been committed and so the current committed version
is still "_0".  Although the files from the next segment version "_1" have been written and
not committed yet.
> Then SH1 shard moved to another shard server (SS2) and once open the version that was
open was also "_0" which is correct.  During the move the directory lock is now owned by SS2
which is also correct.  The SS2 process started the import process again for the external
directory that was not committed.  It also writes new files for the "_1" segment.
> Now back on the SS1 server, the commit is underway and the directory lock is checked
and an exception is thrown because this process no longer owns the lock.  During the rollback
the SS1 server deletes what is thinks are the "_1" segments that it wrote but the files are
actually from the SS2 import process.  Once the files are deleted the abort and rollback is
complete and the index has returned to it's "_0" state.
> However on the SS2 server the commit is moving forward for the "_1" segment (which now
the files have been deleted by the SS1) and the index is corrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message