lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Empty rows from /export?
Date Fri, 31 May 2019 22:08:38 GMT
docValues are indeed, realized in Lucene. It’s just that Lucene has no notion of “schema”.
So when you define the schema, Solr carefully constructs the appropriate low-level Lucene
calls to take care of all of the options you’ve specified in the schema, things like stored,
indexed, docValues etc. when a doc is indexed.

Now we get to optimize. All Solr does is tell Lucene to mash together all the segments and
Lucene does its tricks. Lucene assumes it “knows” everything it needs to know by what’s
already in the segments it’s merging without reference to Solr’s schema. Therein lies
the rub. If one segment has docValues for a field and another segment doesn’t, the result
is “interesting”. In general, Lucene can’t reconstruct the original data.

From Robert Muir:
“I think the key issue here is Lucene is an index not a database. Because it is a lossy
index and does not retain all of the user's data, its not possible to safely migrate some
things automagically. In the norms case IndexWriter needs to re-analyze the text ("re-index")
and compute stats to get back the value, so it can be re-encoded. The function is y = f(x)
and if x is not available its not possible, so lucene can't do it.”

DocValues is a special case because all the data necessary to all docValues is already in
the index, i.e. the indexed data (assuming you originally put it in with indexed=true). But
it requires extra effort, thus the UninvertDocValuesMergePolicyFactory.

>> I was curious if it
>> was safe to change the id field to docValues without reindexing

I’d be very reluctant.  It’s not something that’s explicitly tested or supported so
there’e likely edge cases.

Best,
Erick

> On May 31, 2019, at 2:02 PM, David Hastings <hastings.recursive@gmail.com> wrote:
> 
>> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
> 
> i was under the impression docValues are in lucene, and he is just saying
> that an optimize is not a re-index, its just taking the actual files that
> already exist in your index and arranging them and removing deletions, an
> optimize doesnt re-read the schema and re-index content
> 
> On Fri, May 31, 2019 at 1:59 PM Walter Underwood <wunder@wunderwood.org>
> wrote:
> 
>> Ah. So docValues are managed by Solr outside of Lucene. Interesting.
>> 
>> That actually answers a question I had not asked yet. I was curious if it
>> was safe to change the id field to docValues without reindexing if we never
>> sorted on it. It looks like fetching the value won’t work until everything
>> is reindexed.
>> 
>> It seems like this would be a useful thing to have supported, migrating a
>> field to docValues.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On May 31, 2019, at 5:00 AM, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>> 
>>> bq. but I optimized all the cores, which should rewrite every segment as
>> docValues.
>>> 
>>> Not true. Optimize is a Lucene level force merge. Dealing with segments,
>> i.e. merging and the like, is a low-level Lucene operation and Lucene has
>> no notion of a schema. So a change you made to the schema is irrelevant to
>> merging.
>>> 
>>> You have to have something at the Solr level that does some magic for
>> this to work. Take a look at UninvertDocValuesMergePolicyFactory if you
>> have Solr 7.0 or later. WARNING: I haven’t used that personally, and I do
>> not know what the behavior would be on an index that is “mixed”, i.e. one
>> that already has segments with some docs having DV entries and some not.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On May 31, 2019, at 12:35 AM, Walter Underwood <wunder@wunderwood.org>
>> wrote:
>>>> 
>>>> That field was changed to docValues, but I optimized all the cores,
>> which should rewrite every segment as docValues.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wunder@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On May 30, 2019, at 7:37 PM, Erick Erickson <erickerickson@gmail.com>
>> wrote:
>>>>> 
>>>>> This is odd. The only reason I know of that would happen is if there
>> were no docValues for that field in those documents. By any chance were
>> docValues added to an existing index without totally reindexing into a new
>> collection?
>>>>> 
>>>>> What happens if you just query the collection rather than the
>> individual core? I’m thinking using a streaming expression as a check…..
>>>>> 
>>>>>> On May 30, 2019, at 6:41 PM, Walter Underwood <wunder@wunderwood.org>
>> wrote:
>>>>>> 
>>>>>> 3/4 of the documents I’m getting back from /export are empty. This
>> collection has four shards, so I’m querying the leader core on each shard
>> with /export. The results start like this:
>>>>>> 
>>>>>> 
>> {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},
>>>>>> 
>>>>>> The final 1/4 of the results have UUIDs (the ID type). The id field
>> is stored as docValues. This is the URL.
>>>>>> 
>>>>>> 
>> http://hostname:8983/solr/decks_shard1_replica1/export?q=id:*&distrib=false&shards=shard1&fl=id&sort=id+asc
>>>>>> 
>>>>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from
all
>> four shards is a bit less than 1/4 of the document count.
>>>>>> 
>>>>>> Any ideas about what is going on?
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wunder@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 


Mime
View raw message