lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Confusing DocValues documentation
Date Fri, 22 Dec 2017 19:20:20 GMT
Your questions are already more or less answered:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
You can if you want - just configure your field with stored=“false” and docValues=“true”.
Note that you can do that only if:
* field is not analyzed (you cannot enable docValues for analyzed field)
* you do not care about order of your values

> 2) And why the docValues are not enabled by default for multi-valued fields?
Because it is overhead when it comes to indexing and it is not used in all cases - only if
field is used for faceting, sorting or in functions.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Dec 2017, at 19:51, Tech Id <tech.login.id2@gmail.com> wrote:
> 
> Very interesting discussion SG and Erick.
> I wish these details were part of the official Solr documentation as well.
> And yes, "columnar format" did not give any useful information to me either.
> 
> 
> "A good explanation increases contributions to the project as more people
> become empowered to improvise."
>   - Self, LOL
> 
> 
> I was expecting the sorting, faceting, pivoting to a bit more optimized for
> docValues, something like a pre-calculated bit of information.
> However, now it seems that the major benefit of docValues is to optimize
> the lookup time of stored fields.
> Here is the sorting function I wrote as pseudo-code from the discussion:
> 
> 
> int docIDs[] = filterDocsOnQuery (query);
> T docValues[] = loadDocValues (sortField);
> TreeMap<T, int> sortFieldValues[] = new TreeMap<>();
> for (int docId : docIDs) {
>    T val = docValues[docId];
>    sortFieldValues.put(val, docId);
> }
> // return docIDs sorted by value
> return sortFieldValues.values;
> 
> 
> It is indeed difficult to pre-compute the sorts and facets because we do
> not know what docIDs will be returned by the filtering.
> 
> Two last questions I have are:
> 1) If the docValues are that good, can we git rid of the stored values
> altogether?
> 2) And why the docValues are not enabled by default for multi-valued fields?
> 
> 
> -T
> 
> 
> 
> 
> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
> 
>> OK, last bit of the tutorial.
>> 
>> bq: But that does not seem to be helping with sorting or faceting of any
>> kind.
>> This seems to be like a good way to speed up a stored field's retrieval.
>> 
>> These are the same thing. I have two docs. I have to know how they
>> sort. Therefore I need the value in the sort field for each. This the
>> same thing as getting the stored value, no?
>> 
>> As for facets it's the same problem. To count facet buckets I have to
>> find the values for the  field for each document in the results list
>> and tally them. This is also getting the stored value, right? You're
>> asking "for the docs in my result set, how many of them have val1, how
>> many have val2, how many have val54 etc.
>> 
>> And as an aside the docValues can also be used to return the stored value.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.email@gmail.com> wrote:
>>> Thank you Eric.
>>> 
>>> I guess the biggest piece I was missing was the sort on a field other
>> than
>>> the search field.
>>> Once you have filtered a list of documents and then you want to sort, the
>>> inverted index cannot be used for lookup.
>>> You just have doc-IDs which are values in inverted index, not the keys.
>>> Hence they cannot be "looked" up - only option is to loop through all the
>>> entries of that key's inverted index.
>>> 
>>> DocValues come to rescue by reducing that looping operation to a lookup
>>> again.
>>> Because in docValues, the key (i.e. array-index) is the document-index
>> and
>>> gives an O(1) lookup for any doc-ID.
>>> 
>>> 
>>> But that does not seem to be helping with sorting or faceting of any
>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>> 
>>> DocValues in the current example are:
>>> FieldA
>>> doc1 = 1
>>> doc2 = 2
>>> doc3 =
>>> 
>>> FieldB
>>> doc1 = 2
>>> doc2 = 4
>>> doc3 = 5
>>> 
>>> FieldC
>>> doc1 = 5
>>> doc2 =
>>> doc3 = 5
>>> 
>>> So if I have to run a query:
>>>    fieldA=*&sort=fieldB asc
>>> I will get all the documents due to filter and then I will lookup the
>>> values of field-B from the docValues lookup.
>>> That will give me 2,4,5
>>> This is sorted in this case, but assume that this was not sorted.
>>> (The docValues array is indexed by Lucene's doc-ID not the field-value
>>> after all, right?)
>>> 
>>> Then does Lucene/Solr still sort them like regular array of values?
>>> That does not seem very efficient.
>>> And it does not seem to helping with faceting, pivoting too.
>>> What did I miss?
>>> 
>>> Thanks
>>> SG
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerickson@gmail.com
>>> 
>>> wrote:
>>> 
>>>> Here's where you're going off the rails: "I can just look at the
>>>> map-for-field-A"
>>>> 
>>>> As I said before, you're totally right, all the information you need
>>>> is there. But
>>>> you're thinking of this as though speed weren't a premium when you say.
>>>> "I can just look". Consider that there are single replicas out there
>> with
>>>> 300M
>>>> (or more) docs in them. "Just looking" in a list 300M items long 300M
>> times
>>>> (q=*:*&sort=whatever) is simply not going to be performant compared to
>>>> 300M indexing operations which is what DV does.
>>>> 
>>>> Faceting is much worse.
>>>> 
>>>> Plus space is also at a premium. Java takes 40+ bytes to store the first
>>>> character. So any Java structure you use is going to be enormous. 300M
>> ints
>>>> is bad enough. And if you spoof this by using ordinals as Lucene does,
>>>> you're
>>>> well on your way to reinventing docValues.
>>>> 
>>>> Maybe this will help. Imagine you have a phone book in your hands. It
>>>> consists of documents like this:
>>>> 
>>>> id: something
>>>> phone: phone number
>>>> name: person's name
>>>> 
>>>> For simplicity, they're both string types 'cause they sort.
>>>> 
>>>> Let's search by phone number but sort by name, i.e.
>>>> 
>>>> q=phone:1234*&sort=name asc
>>>> 
>>>> I'm searching and find two docs that match. How do I know how they
>>>> sort wrt each other?
>>>> 
>>>> I'm searching in the phone field but I need the value for each doc
>>>> associated with the name field. In your example I'm searching in
>>>> map-for-fieldA but sorting in map-for-field-B
>>>> 
>>>> To get the name value for these two docs I have to enumerate
>>>> map-for-field-B until I find each doc and then I can get the proper
>>>> value and know how they sort. Sure, I could do some ordering and do a
>>>> binary search but that's still vastly slower than having a structure
>>>> that's a simple index operation to get the value in its field.
>>>> 
>>>> The DV structure is actually more like what's below. These structures
>>>> are simply an array indexed by the _internal_ Lucene document id,
>>>> which is a simple zero-based integer that contains the value
>>>> associated with that doc for that field (I'm simplifying a bit, but
>>>> that's conceptually the deal).
>>>> FieldA
>>>> doc1 = 1
>>>> doc2 = 2
>>>> doc3 =
>>>> 
>>>> FieldB
>>>> doc1 = 2
>>>> doc2 = 4
>>>> doc3 = 5
>>>> 
>>>> FieldC
>>>> doc1 = 5
>>>> doc2 =
>>>> doc3 = 5
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.email@gmail.com> wrote:
>>>>> Thanks a lot Erick and Emir.
>>>>> 
>>>>> I am still a bit confused and an example will help me a lot.
>>>>> Here is a little bit modified version of the same to illustrate my
>> point
>>>>> more clearly.
>>>>> 
>>>>> Let us consider 3 documents - doc1, doc2 and doc3
>>>>> Each contains upto 3 fields - A, B and C.
>>>>> And the values for these fields are random.
>>>>> For example:
>>>>>    doc1 = {A:1, B:2, C:5}
>>>>>    doc2 = {A:2, B:4}
>>>>>    doc3 = {B:5, C:5}
>>>>> 
>>>>> 
>>>>> Inverted Index for the same should be a map of:
>>>>> Key: <value-for-each-field>
>>>>> Value: <document-containing-that-value>
>>>>> i.e.
>>>>> {
>>>>>   map-for-field-A: {1: doc1, 2: doc2}
>>>>>   map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
>>>>>   map-for-field-C: {5: [doc1, doc3]}
>>>>> }
>>>>> 
>>>>> For sorting on field A, I can just look at the map-for-field-A and
>> sort
>>>> the
>>>>> keys (and
>>>>> perhaps keep it sorted too for saving the sort each time). For facets
>> on
>>>>> field A, I can
>>>>> again, just look at the map-for-field-A and get counts for each value.
>>>> So I
>>>>> will
>>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
>>>>> Similarly facets(Field-C) = {5:2}
>>>>> 
>>>>> Why is this not performant? All it did was to bring one data-structure
>>>> into
>>>>> memory and if
>>>>> the current implementation was changed to use OS-cache for the same,
>> the
>>>>> pressure on
>>>>> the JVM would be reduced as well.
>>>>> 
>>>>> So the point I am trying to make here is that how does the
>>>> data-structure of
>>>>> docValues differ from the inverted index I showed above? And how does
>>>> that
>>>>> structure helps it become more performant? I do not want to factor in
>> the
>>>>> OS-cache perspective here for the time being because that could have
>> been
>>>>> fixed in the regular inverted index also. I just want to focus on the
>>>>> data-structure
>>>>> for now that how it is different from the inverted index. Please do
>> not
>>>> say
>>>>> "columnar format" as
>>>>> those 2 words really convey nothing to me.
>>>>> 
>>>>> If you can draw me the exact "columnar format" for the above example,
>>>> then
>>>>> it would be much appreciated.
>>>>> 
>>>>> Thanks
>>>>> SG
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson <
>>>> erickerickson@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> bq: I do not see why sorting or faceting on any field A, B or C would
>>>>>> be a problem. All the values for a field are there in one
>>>>>> data-structure and it should be easy to sort or group-by on that.
>>>>>> 
>>>>>> This is totally true just totally incomplete: ;)
>>>>>> 
>>>>>> for a given field:
>>>>>> 
>>>>>> Inverted structure (leaving out position information and the like):
>>>>>> 
>>>>>> term1: doc1,   doc37, doc 95
>>>>>> term2: doc10, doc37, doc 950
>>>>>> 
>>>>>> docValues structure (assuming multiValued):
>>>>>> 
>>>>>> doc1: term1
>>>>>> doc10: term2
>>>>>> doc37: term1 term2
>>>>>> doc95: term1
>>>>>> doc950: term2
>>>>>> 
>>>>>> They are used to answer two different questions.
>>>>>> 
>>>>>> The inverted structure efficiently answers "for term1, what docs
does
>>>>>> it appear in?"
>>>>>> 
>>>>>> The docValues structure efficiently answers "for doc1, what terms
are
>>>>>> in the field?"
>>>>>> 
>>>>>> So imagine you have a search on term1. It's a simple iteration of
the
>>>>>> inverted structure to get my result set, namely docs 1, 37, and 95.
>>>>>> 
>>>>>> But now I want to facet. I have to get the _values_ for my field
from
>>>>>> the entire result set in order to fill my count buckets. Using the
>>>>>> uninverted structure, I'd have to scan the entire table term-by-term
>>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and
add
>>>>>> to my total for the term. Think "table scan".
>>>>>> 
>>>>>> Instead I use the docValues structure which is much faster, I already
>>>>>> know all I'm interested in is these three docs, so I just read the
>>>>>> terms in the field for each doc and add to my counts. Again, to
>> answer
>>>>>> this question from the wrong (in this case inverted structure) I'd
>>>>>> have to do a table scan. Also, this would be _extremely_ expensive
to
>>>>>> do from stored fields.
>>>>>> 
>>>>>> And it's the inverse for searching the docValues structure. In order
>>>>>> to find which doc has term1, I'd have to examine all the terms for
>> the
>>>>>> field for each document in my index. Horribly painful.
>>>>>> 
>>>>>> So yes, the information is all there in one structure or the other
>> and
>>>>>> you _could_ get all of it from either one. You'd also have a system
>>>>>> that was able to serve 0.00001 QPS on a largish index.
>>>>>> 
>>>>>> And remember that this is very simplified. If you have a complex
>> query
>>>>>> you need to get a result set before even considering the
>>>>>> facet/sort/whatever question so gathering the term information as
I
>>>>>> searched wouldn't particularly work.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.email@gmail.com>
>> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> It seems that docValues are not really explained well anywhere.
>>>>>>> Here are 2 links that try to explain it:
>>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-
>> solr-4-2/
>>>>>>> 2)
>>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/
>>>>>> current/docvalues.html
>>>>>>> 
>>>>>>> And official Solr documentation that does not explain the internal
>>>>>> details
>>>>>>> at all:
>>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
>>>>>>> 
>>>>>>> The first links says that:
>>>>>>>  The row-oriented (stored fields) are
>>>>>>>  {
>>>>>>>    'doc1': {'A':1, 'B':2, 'C':3},
>>>>>>>    'doc2': {'A':2, 'B':3, 'C':4},
>>>>>>>    'doc3': {'A':4, 'B':3, 'C':2}
>>>>>>>  }
>>>>>>> 
>>>>>>>  while column-oriented (docValues) are:
>>>>>>>  {
>>>>>>>    'A': {'doc1':1, 'doc2':2, 'doc3':4},
>>>>>>>    'B': {'doc1':2, 'doc2':3, 'doc3':3},
>>>>>>>    'C': {'doc1':3, 'doc2':4, 'doc3':2}
>>>>>>>  }
>>>>>>> 
>>>>>>> And the second link gives an example as:
>>>>>>> Doc values maps documents to the terms contained by the document:
>>>>>>> 
>>>>>>>  Doc      Terms
>>>>>>>  ------------------------------------------------------------
>> -----
>>>>>>>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>>>>>>>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>>>>>>>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
>>>>>>>  ------------------------------------------------------------
>> -----
>>>>>>> 
>>>>>>> 
>>>>>>> To me, this example is same as the row-oriented (stored fields)
>>>> format in
>>>>>>> the first link.
>>>>>>> Which one is right?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Also, the column-oriented (docValues) mentioned above are:
>>>>>>> {
>>>>>>>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
>>>>>>>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
>>>>>>>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
>>>>>>> }
>>>>>>> Isn't this what the inverted index also looks like?
>>>>>>> Inverted index is an index of the term (A,B,C) to the document
and
>> the
>>>>>>> position it is found in the document.
>>>>>>> 
>>>>>>> 
>>>>>>> Or is it better to say that the inverted index is of the form:
>>>>>>> {
>>>>>>>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>>>>>>>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>>>>>>>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
>>>>>>> }
>>>>>>> But even if that is true, I do not see why sorting or faceting
on
>> any
>>>>>> field
>>>>>>> A, B or C would be a problem.
>>>>>>> All the values for a field are there in one data-structure and
it
>>>> should
>>>>>> be
>>>>>>> easy to sort or group-by on that.
>>>>>>> 
>>>>>>> Can someone explain the above a bit more clearly please? A
>> build-upon
>>>> the
>>>>>>> same example as above would be really good.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks
>>>>>>> SG
>>>>>> 
>>>> 
>> 


Mime
View raw message