Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CADuAvzdTY0YObYkV1zDCki86a99HOmLp7_NKMqDBFcXuEC1=0g@mail.gmail.com>
References: 
 <CADuAvzfo8v8gPzwyFET+pZrvssU-V8q-ZpiKg-Mf6-Wo_-CQDg@mail.gmail.com>
 <CAPsWd+MUsyDOMtNjo+5dDQOw_SZ=mx2hiQwRqv+dW61Tax4n5A@mail.gmail.com>
 <CADuAvzdTY0YObYkV1zDCki86a99HOmLp7_NKMqDBFcXuEC1=0g@mail.gmail.com>
From: =?UTF-8?B?QW5kcsOhcyBQw6l0ZXJp?= <apeteri@b2international.com>
Date: Sun, 9 Aug 2015 17:51:54 +0200
Message-ID: 
 <CAO=0Loax+M-xHTn01nH8BQytMEdkvigPFMwgYbEVqj51zPqYtw@mail.gmail.com>
Subject: Re: Mapping doc values back to doc ID (in decent time)
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=f46d0435c0344cb5d3051ce2da3f

--f46d0435c0344cb5d3051ce2da3f
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

If I understand it correctly, the Zoie library [1][2] implements the
"sledgehammer" approach by collecting docValues for all documents when a
segment reader is opened. If you have some RAM to throw at the problem,
this could indeed bring you an acceptable level of performance.

[1] http://senseidb.github.io/zoie/
[2]
https://github.com/senseidb/zoie/blob/master/zoie-core/src/main/java/proj/z=
oie/api/impl/DocIDMapperImpl.java

On Sun, Aug 9, 2015 at 9:41 AM, Trejkaz <trejkaz@trypticon.org> wrote:

> On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand <jpountz@gmail.com> wrote:
> > Does your application actually iterate in order over dense ids, or is
> > it just for benchmarking purposes? Because if it does, you probably
> > don't actually need seeking, you could just see what the current ID in
> > the terms enum is.
>
> Both dense ID fetches and individual ID fetches exist in the
> application. I put them in a benchmark deliberately doing it as
> individual fetches to get an idea of average timing for a single
> operation.
>
> There are so many use cases of doing the individual fetches that it's
> tough to enumerate. The first one I found was "fetch the term vector
> for ID + field" but I'm sure there will be tons of them.
>
> For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
> would probably use something like DocValuesTermsQuery for that to get
> them all in one shot. I also wondered whether writing our filters as
> queries would help, but I think it would turn out to be about as fast
> as DocValuesTermsQuery even if I did that.
>
> I'm sure the only way to really improve the speed of these filters is
> to start storing these things in the text index and use query-time
> joins, but I can't do that until I solve the issue of relying on
> stable doc IDs and it seems like trying to solve two large problems in
> a single commit would be biting off more than I can chew.
>
> > If you actually need seeking, then you should try
> > to avoid MultiFields, it will call seedExact on each segment, while
> > given what I see you could just stop after you found one segment with
> > the value.
>
> Ah, I did wonder whether MultiFields had any behaviour like that, so
> that definitely means that I will avoid using it. Then I can try other
> tricks, like trying the seeks in order of segment size (the largest
> segment is most likely to contain the hit.)
>
> TX
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--=20
Andr=C3=A1s

--f46d0435c0344cb5d3051ce2da3f--