lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Confusing DocValues documentation
Date Thu, 21 Dec 2017 20:41:11 GMT
Hi SG,
It is all ok - it’s just that notation is different. Please see inline comments.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Dec 2017, at 18:56, S G <sg.online.email@gmail.com> wrote:
> 
> Hi,
> 
> It seems that docValues are not really explained well anywhere.
> Here are 2 links that try to explain it:
> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> 2)
> https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html
> 
> And official Solr documentation that does not explain the internal details
> at all:
> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> 
> The first links says that:
>  The row-oriented (stored fields) are
>  {
>    'doc1': {'A':1, 'B':2, 'C':3},
>    'doc2': {'A':2, 'B':3, 'C':4},
>    'doc3': {'A':4, 'B':3, 'C':2}
>  }
[EA] These are input documents. For more completeness,  it would be good if one example is
multivalue field.

> 
>  while column-oriented (docValues) are:
>  {
>    'A': {'doc1':1, 'doc2':2, 'doc3':4},
>    'B': {'doc1':2, 'doc2':3, 'doc3':3},
>    'C': {'doc1':3, 'doc2':4, 'doc3':2}
>  }
[EA] You can focus here on one field.

> 
> And the second link gives an example as:
> Doc values maps documents to the terms contained by the document:
> 
>  Doc      Terms
>  -----------------------------------------------------------------
>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
>  ————————————————————————————————
[EA] And this is the “multiline” version of single field with multiple values. Note here
that terms are deduplicated and sorted.

> 
> 
> To me, this example is same as the row-oriented (stored fields) format in
> the first link.
> Which one is right?
[EA] As explained earlier, this is single field column-oriented structure. In first link notation,
row-oriented would be:
{
  ‘Doc_1’: {‘text_field’: ’The quick brown fox jumped over lazy dog’, ’some_other_field’:….}
  ‘Doc_2’:…
}
and column-oriented would be:
{
  ’text_field’: {‘Doc_1’: [‘brown’, ‘dog’, ‘fox’,….], ‘Doc_2’: [‘brown’,
‘dog’,…]}
}

> 
> 
> 
> Also, the column-oriented (docValues) mentioned above are:
> {
>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
> }
> Isn’t this what the inverted index also looks like?
[EA] No - inverted index is…well… inverted :) Keys are values and values are doc ids.

> Inverted index is an index of the term (A,B,C) to the document and the
> position it is found in the document.
> 
> 
> Or is it better to say that the inverted index is of the form:
> {
>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> }
[EA] This is inverted index.

> But even if that is true, I do not see why sorting or faceting on any field
> A, B or C would be a problem.
[EA] It is more obvious when you try with multivalue fields: imagine you want to facet on
text_field in previous example and have matched Doc_1 and Doc_2.…Doc_n.  How would you do
it with only inverted structure? You would have to check each term to see how many docs from
resultset does it contain. And stored fields are not deduplicated and optimized for quick
access.
On the other hand, you can use doc values as stored fields if you can accept that they will
be sorted.

> All the values for a field are there in one data-structure and it should be
> easy to sort or group-by on that.
> 
> Can someone explain the above a bit more clearly please? A build-upon the
> same example as above would be really good.
> 
> 
> Thanks
> SG


Mime
View raw message