lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitaly Funstein <vfunst...@gmail.com>
Subject Re: Stored fields and OS file caching
Date Sat, 05 Apr 2014 02:12:33 GMT
Thanks for the explanation, Adrien. I do have a couple of follow-up
questions. Isn't this block size used for file caching OS-dependent? And if
4K happens to be the most commonly used size, wouldn't it make more sense
for the default stored fields format to have a chunk size equal to or
smaller than that number? It's a bit of a guess on my part, but I did get
better write and search performance with size <= 2K, as opposed to the
default 16K.


On Fri, Apr 4, 2014 at 3:50 PM, Adrien Grand <jpountz@gmail.com> wrote:

> Hi Vitaly,
>
> Doc values are indeed well-suited for grouping and sorting. However
> stored fields remain better at returning field values to users since
> they guarantee a worst-case of one disk seek per document.
>
> The filesystem cache typically caches data by blocks of 4KB. This
> plays more nicely with doc values: given that they are stored in a
> column-stride fashion, you are load only those field values into the
> filesystem cache. On the other hand with stored fields, data is stored
> sequentially in a very large file, so whenever you read a single field
> value, the filesystem cache would load a 4KB block of data into the
> filesystem cache that likely contains other fields' values that you
> are not interested in.
>
>
>
> On Sat, Apr 5, 2014 at 12:23 AM, Vitaly Funstein <vfunstein@gmail.com>
> wrote:
> > I use stored fields to load values for the following use cases:
> > - to return per-document values as is, requested by the user - similar to
> > listing DB columns you are interested in, in a "select ..." clause.
> > - to perform aggregate function calculations while forming the result set
> > (if requested).
> > - for group-by type queries (would like to switch to the native grouping
> > API, but don't think it supports grouping on multiple fields, or
> aggregate
> > functions).
> > - and finally, as I mentioned - to sort search results, also when
> requested.
> >
> > Evidently, even for simple queries that don't require any of the
> > post-processing above but ask for a set of values from each document,
> > there's still non-trivial amount of disk activity... hence, I started
> > second-guessing the implementation.
> >
> >
> > On Fri, Apr 4, 2014 at 3:00 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> >> Hi,
> >>
> >> What are you doing with the stored fields? They are not deprecated and
> >> also not really slow, unless you scan over millions of documents in
> random
> >> access order. To display serach results, DocValues are of no use.
> >>
> >> Uwe
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >>
> >>
> >> > -----Original Message-----
> >> > From: Vitaly Funstein [mailto:vfunstein@gmail.com]
> >> > Sent: Friday, April 04, 2014 9:44 PM
> >> > To: java-user@lucene.apache.org
> >> > Subject: Stored fields and OS file caching
> >> >
> >> > I have heard here that stored fields don't work well with OS file
> >> caching.
> >> > Could someone elaborate on why that is? I am using Lucene 4.6 and we
> do
> >> > use stored fields but not doc values; it appears most of the benefit
> >> from the
> >> > latter comes as improvement in sorting performance, and I don't
> actually
> >> use
> >> > Lucene for sorting at all; rather, it's done on a post-processing
> basis,
> >> based on
> >> > stored field values (in a nutshell, the reason for this is Lucene's
> >> inability to tell
> >> > apart terms that are empty strings vs. a missing value, resulting in
> >> unstable
> >> > sort order on such fields).
> >> >
> >> > I am not sure if switching to using doc values fields from stored
> fields
> >> entirely
> >> > would help leverage OS file cache better... what worries me is that
> when
> >> > processing queries requesting multiple values from the document, doc
> >> value
> >> > fields could cause multiple disk seeks to fetch values for each
> field, as
> >> > opposed to just one with stored fields.
> >> >
> >> > Am I way off in my understanding of how this works? Any guidelines, as
> >> > general as they may be, are appreciated.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message