lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clemens Wyss DEV <clemens...@mysign.ch>
Subject AW: AW: AW: SolrClient#updateByQuery?
Date Mon, 29 Jan 2018 07:08:11 GMT
Yet again: thanks a lot!

Spellchecking@solr: 
what are the best (up-to-date) sources/links for spellchecking and suggestions?

-----Ursprüngliche Nachricht-----
Von: Erick Erickson [mailto:erickerickson@gmail.com] 
Gesendet: Sonntag, 28. Januar 2018 18:23
An: solr-user <solr-user@lucene.apache.org>
Betreff: Re: AW: AW: SolrClient#updateByQuery?

bq: I am still getting "suggestions" (from spellcheck.q)

OK, this is actually expected behavior. The spellcheck is done from the _indexed_ terms. Documents
deleted from the index are marked as deleted, the associated terms are not purged from the
index until the segment is merged. When just checking the terms for spellcheck, there's no
good way to figure out that a term is part of a deleted doc.

Your expungeDeletes "fix" really wouldn't have actually fixed your problem in any kind of
production environment. ExpungeDeletes merges segments with > n% deleted docs. It only
fixed your test case because, I suspect, you have very few documents (perhaps only one) in
your segment, so it was merged away. In a situation where you had, say,
10,000 docs in a segments and you deleted the one (and only) document with some term, expungeDeletes
would skip merging the segment and spellcheck would still have returned the suggestion.

Optimize on the other hand unconditionally rewrites all segments into a single segment, so
that was removing the indexed term. As discussed, optimize is a _very_ expensive operation
and, unless you're able to optimize after every indexing session it will not scale. The situations
where I've seen this be acceptable are ones in which the index changes rarely, for example
you update your index once a day. If you continually update your index, optimizing will actually
make this problem worse between optimizations, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

At a higher level, you're expending a lot of effort to handle the case where a document is
deleted and it's the last "live" doc in your entire corpus that contains a term. For a decent
sized corpus this will be quite rare so often people simply don't worry about it. The scenario
in your test case is somewhat artificial and makes it seem more likely than it will probably
be "in the real world".

Consider setting spellcheck's thresholdTokenFrequency to some value.
That parameter's primary purpose is to handle situations where words are misspelled in the
documents so you don't suggest those misspelled words, but I think it would cover this situation
too. Unfortunately it will not work very well in a simple test setup either. Let's say you
set it to 2%. You index 100 documents and 3 of them contain the term.
It's now found in your spellcheck test. Now you delete two of them (without merging any segments).
The term frequency is _still_ 3% so it still may be found after you delete and commit.

I suppose you could structure your test this way:
index 100 docs, 3 of them have a specific term.
Set your threshold to 2%
Check that the term is suggested
index 100 more docs
Check that the term is _not_ suggested.

Best,
Erick

On Sun, Jan 28, 2018 at 7:24 AM, Clemens Wyss DEV <clemensdev@mysign.ch> wrote:
> I must clarify a few things:
> the unittest I noted does not check/perform a DBQ but a "simple" deleteById.
> The deleted document is no more found (as expected) BUT I am still getting "suggestions"
(from spellcheck.q). So my problem is not that I find deleted documents but suggestions resulting
from the deleted document.
>
> The suggestions-configuration is as follows:
> <searchComponent name="suggest_phrase" class="solr.SpellCheckComponent">
>             <lst name="spellchecker">
>                 <str name="name">suggest_phrase_fuzzy</str>
>                 <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
>                 <str name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory</str>
>                 <str name="allTermsRequired">true</str>
>                 <str name="maxEdits">2</str>
>                 <str name="ignoreCase=">true</str>
>                 <str name="field">_my_suggest_phrase</str>
>                 <str name="suggestAnalyzerFieldType">string</str> <!--
suggest_phrase -->
>                 <!--  <str name="storeDir">suggest_phrase_fuzzy</str>
 -->
>                 <str name="buildOnOptimize">false</str>
>                 <str name="buildOnStartup">false</str> <!-- ?? -->
>                 <str name="buildOnCommit">true</str>
>             </lst>
>         </searchComponent>
>
> Most importantly: "buildOnCommit"->true.
>
> The question hence is:
> What (which commit?) do I need to do after
>>solrClient.deleteById( toBeDeletedDocumentIDs );
>
> for the suggestions to be up-to-date too (without heavy commit/optimize)?
>
> thx and sorry for the misunderstandings
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerickson@gmail.com]
> Gesendet: Samstag, 27. Januar 2018 18:20
> An: solr-user <solr-user@lucene.apache.org>
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> Clemens:
>
> Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what you think
or you have some invalid expectations. This is such a fundamental feature that it'd surprise
me a _lot_ if it were a bug.
> Also, there are a bunch of DeleteByQuery tests in the junit tests that's run all the
time..
>
> Wait, are you issuing an explicit commit or not? I saw this phrase 
> "...brutely by forcing a commit (with "expunge deletes")..." and saw 
> the word "commit" and assumed you were issuing a commit, but 
> re-reading that's not clear at all. Code should look something like
>
> update-via-delete-by-query
> solrClient.commit();
> query to see if doc is gone.
>
> So here's what I'd try next:
>
> 1> Issue an explicit commit command (SolrCient.commit()) after the
> DBQ. The defaults there are openSearcher = true and waitSearcher= true. When that returns
_then_ issue your query.
> 2> If that doesn't work, try (just for information gathering) waiting,
> several seconds after the commit to try your request. This should _not_ be necessary,
but it'll give us a clue what's going on.
> 3> Show us the code if you can.
>
> Best,
> Erick
>
>
> On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <clemensdev@mysign.ch> wrote:
>> Erick said/wrote:
>>> If you commit after docs are deleted and _still_ see them in search 
>>> results, that's a JIRA
>> should I JIRA it?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Shawn Heisey [mailto:apache@elyograg.org]
>> Gesendet: Samstag, 27. Januar 2018 12:05
>> An: solr-user@lucene.apache.org
>> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>>
>> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>>> Thanks for all these (main contributor's 😉) valuable inputs!
>>>
>>> First thing I did was getting getting rid of "expungeDeletes". My 
>>> "single-deletion" unittest failed unti I added the optimize-param
>>>> updateReques.setParam( "optimize", "true" );
>>> Does this make sense or should JIRA it?
>>> How expensive ist this "optimization"?
>>
>> An optimize operation is a complete rewrite of the entire index to one segment. 
It will typically double the size of the index.  The rewritten index will not have any documents
that were deleted in it.  It's slow and extremely expensive.  If the index is one gigabyte,
expect an optimize to take at least half an hour, possibly longer, to complete.
>> The CPU and disk I/O are going to take a beating while the optimize is occurring.
>>
>> Thanks,
>> Shawn
Mime
View raw message