lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: updateDocument (somtimes) no longer deleting documents after Update to 4.6
Date Mon, 24 Feb 2014 19:42:06 GMT
Hi,

it looks like your filters are implemented in a wrong way:

- First, in Lucene 3 and 4, filters are applied by segment. Means, they have to calculate
the DocIdSet of matched documents for each index segment separately. On updating, the document
is "deleted" (hidden) on the old segment and re-added to a new index segment. This is why
you see it two times in the filter.
- Second, in Lucene 4, Filters now get (Bits acceptDocs) in their getDocIdSet method. This
is new, before the deleted documents were applied *after* the filters, now together with the
filters. If acceptDocs is non-null, these are "hidden" deleted documents. If you filter does
not applies those accept docs correctly to the returned DocIdSet, the deleted document suddenly
reappear. In Lucene 4, all deleted documents is just an addition filter applied while searching:
A filter that marks still accessible documents and hides all deleted documents. If your new
filter does not chain in this additional filter, the deletions are ignored. A quick fix is
to use "return BitsFilteredDocIdSet.wrap(yourFilterBitSet, acceptDocs)" instead of "return
yourFilterBitSet".

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: nospam@kaigrabfelder.de [mailto:nospam@kaigrabfelder.de]
> Sent: Monday, February 24, 2014 7:14 PM
> To: java-user@lucene.apache.org
> Subject: Re: updateDocument (somtimes) no longer deleting documents
> after Update to 4.6
> 
> Hm it looks like this is somehow caused by the filters we are using for
> searching.
> 
> I took one of the MY_UNIQUE_BUSINESS_ID ids, used in our applications
> search functionality and debuged the lucene search a little more. If I specify
> null for the filters I only get one result (which is correct).
> If I add the two filters that we usually use in our application I notice that the
> filters are triggered twice - for two different segments - and the result is
> contained in both segments. Looks like the first segment contains all
> documents in the index with the second segment containing only one - the
> document that should have been deleted upfront.
> 
> This can be reproduced even after restarting the application and even after
> indexWriter.commit is triggered
> 
> Could this be a bug? Or is this the desired behaviour?
> 
> Best Regards
> 
> Kai
> 
> 
> Am 2014-02-24 13:54, schrieb nospam@kaigrabfelder.de:
> > I'll see if I can dig a little bit deeper into the 3.6 behavior, for
> > now I'm trying to get it running on 4.6 (as the index file is also a
> > lot smaller - on 3.6 it was about 2 GB for about 9000 documents, with
> > 4.6 it's only about 200 MB).
> >
> > And yes the business ID is indexed - otherwhise I wouldn't be able to
> > find it at all - The problem is not that I can't find it but I find it
> > twice. And to make matters worse not consistently all the bime but
> > only sometimes. Somehow it looks like the delete (before the update)
> > does sometimes work and sometimes not. Do you know any chances why
> > this could happen? Maybe something related to the MergePoliy (which we
> > don't set e.g. we are using the default)
> >
> > Best Regards
> >
> > Kai
> >
> >
> > Am 2014-02-24 12:10, schrieb Michael McCandless:
> >> The 30 second turnaround time in 3.6.x is absurd; if you turn on
> >> IndexWriter's infoStream maybe it'd give a clue.  Or, capture a few
> >> stack traces and post them.
> >>
> >> How are you creating the luceneDocumentToIndex?  You must ensure
> that
> >> the business ID is in fact indexed as a field in the document,
> >> otherwise the update won't find it.
> >>
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Mon, Feb 24, 2014 at 5:33 AM,  <nospam@kaigrabfelder.de> wrote:
> >>> Hi there,
> >>>
> >>> we recently updated our application from lucene 3.0 to 3.6 with the
> >>> effect that (albeit using the SearchManager functionality as
> >>> described on
> >>>
> >>> http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-
> simpl
> >>> ifies.html) calls to searcherManager.maybeRefresh() were incredibly
> >>> slow. e.g.
> >>> taking
> >>> about 30 seconds after adding one document to the index with an
> >>> index of about 9000 documents. I assumed that we did something wrong
> >>> with the configuration as 30 seconds could not be meant with NRT ;-)
> >>>
> >>> Thus we migrated to the latest 4.6 version and indexing speed was
> >>> indeed very good now (with the
> >>> searcherManager.maybeRefreshBlocking() call only taking milliseconds
> >>> to complete). But after some wore testing we discovered that somehow
> >>> the indexWriter.updateDocument( term, documentToIndex
> >>> )
> >>> functionality wasn't working anymore as expected - at least
> >>> somtetimes. It looks like either the updateDocument method does not
> >>> longer reliably delete the old document before adding a new one -
> >>> with the result that older documents are beeing returned by searches
> >>> breaking our application.
> >>>
> >>> Unfortunately I'm not able to reproduce the issues in a simple unit
> >>> test but maybe somebody of the lucene experts knows what we are
> >>> doing wrong here. Not sure if it is of any relevance but we are
> >>> running on Windows with a
> >>> 64 bit
> >>> JDK 7 thus MMapDirectory is beeing used.
> >>>
> >>> Our Index Writer is configured like this:
> >>>
> >>>         IndexWriterConfig conf = new IndexWriterConfig(
> >>> Version.LUCENE_46, new LimitTokenCountAnalyzer( new
> >>> DefaultAnalyzer(), Integer.MAX_VALUE ) );
> >>>
> >>>
> >>>         conf.setOpenMode( OpenMode.APPEND );
> >>>
> >>>         IndexWriter indexWriter = new IndexWriter(
> >>> FSDirectory.open( new
> >>> File( directoryPath )), conf );
> >>>
> >>> SearcherManager is configured like this:
> >>>
> >>>         searcherManager = new SearcherManager(indexWriter, true,
> >>> null);
> >>>
> >>> // The anlyzer that we are using looks like this:
> >>>
> >>>         public class DefaultAnalyzer extends Analyzer
> >>>         {
> >>>            @Override
> >>>            protected TokenStreamComponents createComponents(final
> >>> String
> >>> fieldName,
> >>>                    final Reader reader) {
> >>>                  return new TokenStreamComponents(new
> >>> WhitespaceTokenizer(LuceneSearchService.LUCENE_VERSION, reader));
> >>>            }
> >>>         }
> >>>
> >>> The update of the index looks like this:
> >>>
> >>>         // instead of 42 the unique business identifier is used
> >>>         Long myUniqueBusinessId = 42l;
> >>>         BytesRef ref = new BytesRef(NumericUtils.BUF_SIZE_LONG);
> >>>         NumericUtils.longToPrefixCoded(
> >>> myUniqueBusinessId.longValue(), 0,
> >>> ref );
> >>>         Term term = new Term( "MY_UNIQUE_BUSINESS_ID", ref );
> >>>
> >>>         // this method may be called multiple times with the same
> >>> term and
> >>> luceneDocumentToIndex parameter
> >>>         indexWriter.updateDocument( term, luceneDocumentToIndex);
> >>>
> >>>         // After performing a couple of updates we execute
> >>>         searcherManager.maybeRefreshBlocking();
> >>>
> >>>
> >>> // For searching we are using the following code
> >>>         searcher = searcherManager.acquire();
> >>>         // luceneQuery is the query, filter is some sort of
> >>> filtering that
> >>> we apply, luceneSort is some sorting query
> >>>         TopDocs topDocs = searcher.search( luceneQuery, filter,
> >>> 1000,
> >>> luceneSort );
> >>>
> >>> // If we perform a query for MY_UNIQUE_BUSINESS_ID it will return
> >>> multiple
> >>> results instead of just one - this was neither the case with lucene
> >>> 3.0 nor
> >>> 3.6
> >>>
> >>>
> >>> In order to fix the issue I tried couple of things but to now
> >>> avail. It
> >>> still happens (not all the time though) that the lucene returns two
> >>> documents when querying for MY_UNIQUE_BUSINESS_ID instead of
> just
> >>> one
> >>> -       setting setMaxBufferedDeleteTerms to 1 in the config
> >>>         conf.setMaxBufferedDeleteTerms( 1 );
> >>> - explicetly deleting instead of just updating
> >>>         indexWriter.deleteDocuments( term );
> >>> - ensuring that the field MY_UNIQUE_BUSINESS_ID is stored in the
> >>> index and
> >>> not just analysed
> >>> - trying to delete the document via indexWriter.tryDeleteDocument()
> >>> - calling indexWriter.maybeMerge() after the update
> >>> - calling indexWriter.commit() after the update
> >>>
> >>>
> >>> Sorry for the lenghty post but I wanted to include as much
> >>> information as
> >>> possible. Let me know if something is missing...
> >>>
> >>> Thanks for helping in advance ;-)
> >>>
> >>> Kai
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message