lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Strange results when documents gets delete while iterating
Date Thu, 19 Nov 2015 15:03:58 GMT
On Thu, Nov 19, 2015 at 4:39 AM, Gerald Richter - ECOS Technology
<Gerald.Richter@ecos.de> wrote:
> Hi,
>
> It's a local IndexSearcher.
>
> I have done a lot of tests and it's really happening.
>
> Let me give you a little more details, maybe this helps:
>
> - I call a function that creates a new IndexSearcher and call $hits = $searcher ->
hits.
> - I iterate over the first few entries and returns the entries and the $hits
> - The documents that were found are deleted from a database, which in turn deletes the
documents from the Lucy index.
> - Now I iterate over the next few entries and delete them and so on
>
> I have made small test where per iteration only two entries are fetch. The result looks
like this:
>
>       id  => "8b8bce64e69b52ed244671009c11ee0e",
>       id  => "8b8bce64e69b52ed244671009c4857e7",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584b68",
>       id  => "8b8bce64e69b52ed244671009c730dc9",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72584d19",
>       id  => "8b8bce64e69b52ed244671009c7e3974",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585475",
>       id  => "8b8bce64e69b52ed244671009c7e4788",
>       id  => "4a3dcd6c2e9e3074d2d52b8e72585dc2",
>       id  => "8b8bce64e69b52ed244671009c7e2fa6",
>
> id is some value I store in the document. The result should only contain ids starting
with 8.
>
> So you see the first two are correct, after deletion of this two (always in a different
process), the next time, the first one I get is wrong the second one is correct...
>
> If I do not delete anything I only get the right entries (just commented out one line
the rest is still the same).
>
> Any clue?

When documents in an old segment are marked as deleted, that information is
written to a bitmap deletions file which is written to a new segment.  Old
readers are not supposed to know about new segments.  So for something to go
wrong, either 1) information in an old segment would have to be corrupted, 2)
a reader would have to somehow find out about information in a new segment, or
3) somthing else unrelated.

Indexers write index data (including new deletions data referencing documents
in old segments) to temp files in a new segment, which are then consolidated
into a single per-segment "compound file" named "cf.dat".  When a reader
opens, it mmaps cf.dat for each segment in the snapshot.  Once the reader
successfully opens all the files it needs, it never goes looking for new
files.

It's hard to imagine a mechanism that would either cause an existing "cf.dat"
file to be modified, or persuade a reader to go look at a new "cf.dat"
file.  So unless my reasoning is wrong, the cause is #3 -- something else
unrelated.  I really have no idea what that could be, though since you've
previously asked some questions about Coro/AnyEvent and other concurrency
stuff the most likely prospect would seem to be something unique to your
setup.

The next step is probably to take the behavior you've been able to reproduce
and isolate it in a test case that others can run and analyze.

Marvin Humphrey

Mime
View raw message