lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Enabling/disabling docValues
Date Mon, 10 Jun 2019 17:54:56 GMT
bq. Does lucene look at %docs in each state, or the first doc or something else?

Frankly I don’t care since no matter what, the results of faceting mixed definitions is
not useful.

tl;dr;

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means just what
I choose it to mean — neither more nor less.’

So “undefined" in this case means “I don’t see any value at all in chasing that info
down” ;).

Changing from regular text to SortableText means that the results will be inaccurate no matter
what. For example, I have a doc with the value “my dog has fleas”. When NOT using SortableText,
there are multiple tokens so facet counts would be:

my (1)
dog (1)
has (1)
fleas (1)

But for SortableText will be:

my dog has fleas (1)

Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”. doc1 was 
indexed before switching to SortableText and doc2 after. Presumably  the output you want is:

my dog has fleas (1)
my cat has fleas (1)

But you can’t get that output.  There are three cases:

1> Lucene treats all documents as SortableText, faceting on the docValues parts. No facets
on doc1

my  cat has fleas (1) 

2> Lucene treats all documents as tokenized, faceting on each individual token. Faceting
is performed on the tokenized content of both,  docValues in doc2  ignored

my  (2)
dog (1)
has (2)
fleas (2)
cat (1)


3> Lucene does the best it can, faceting on the tokens for docs without SortableText and
docValues if the doc was indexed with Sortable text. doc1 faceted on tokenized, doc2 on docValues

my  (1)
dog (1)
has (1)
fleas (1)
my cat has fleas (1)

Since none of those cases is what I want, there’s no point I can see in chasing down what
actually happens….

Best,
Erick

P.S. I _think_ Lucene tries to use the definition from the first segment, but since whether
the lists of segments to be  merged don’t look at the field definitions at all. Whether
the first segment in the list has SortableText or not will not be predictable in a general
way even within a single run.


> On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925254@gmail.com> wrote:
> 
> Understood, however code is rarely random/undefined. Does lucene look at %
> docs in each state, or the first doc or something else?
> 
> On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <erickerickson@gmail.com>
> wrote:
> 
>> It’s basically undefined. When segments are merged that have dissimilar
>> definitions like this what can Lucene do? Consider:
>> 
>> Faceting on a text (not sortable) means that each individual token in the
>> index is uninverted on the Java heap and the facets are computed for each
>> individual term.
>> 
>> Faceting on a SortableText field just has a single term per document, and
>> that in the docValues structures as opposed to the inverted index.
>> 
>> Now you change the value and start indexing. At some point a segment
>> containing no docValues is merged with a segment containing docValues for
>> the field. The resulting mixed segment is in this state. If you facet on
>> the field, should the docs without docValues have each individual term
>> counted? Or just the SortableText values in the docValues structure?
>> Neither one is right.
>> 
>> Also remember that Lucene has no notion of schema. That’s entirely imposed
>> on Lucene by Solr carefully constructing low-level analysis chains.
>> 
>> So I’d _strongly_ recommend you re-index your corpus to a new collection
>> with the current definition, then perhaps use CREATEALIAS to seamlessly
>> switch.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925254@gmail.com>
>> wrote:
>>> 
>>> Hi there,
>>> We recently changed a field from TextField + no docValues to
>>> SortableTextField which has docValues enabled by default. Once I did
>> this I
>>> do not see any facet values for the field. I know that once all the docs
>>> are re-indexed facets should work again, however can someone clarify the
>>> current logic of lucene/solr how facets will be computed when schema is
>>> changed from no docValues to docValues and vice-versa?
>>> 
>>> 1. Until ALL the docs are re-indexed, no facets will be returned?
>>> 2. Once certain fraction of docs are re-indexed, those facets will be
>>> returned?
>>> 3. Something else?
>>> 
>>> 
>>> Varun
>> 
>> 


Mime
View raw message