lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Statically store sub-collections for search (faceted search?)
Date Sat, 13 Apr 2013 17:53:29 GMT
Hi Carsten,

You're right that Lucene document numbers are ephemeral, but they are
consistent for a certain IndexReader instance. So perhaps you can use
SearcherLifetimeManager to obtain a 'version' of the reader that returned
the original results and store a bitset together with that version. Then
when the user further searches this subset of documents, you pull the
relevant reader from SLM given the 'version' information.

I think that you can write your own Pruner which prunes IR
instances/versions when their corresponding docs subset tables are no
longer needed...

Shai


On Fri, Apr 12, 2013 at 9:08 PM, SUJIT PAL <sujit.pal@comcast.net> wrote:

> Hi Carsten,
>
> Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
> Since you are not doing any scoring (only filtering), the max boolean
> clauses limit should not apply to a filter.
>
> -sujit
>
> On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote:
>
> > Dear list,
> > I would like to create a sub-set of the documents in an index that is to
> > be used for further searches. However, the criteria that lead to the
> > creation of that sub-set are not predefined so I think that faceted
> > search cannot be applied my this use case.
> >
> > For instance:
> > A user searches for documents that contain token 'A' in a field 'text'.
> > These results form a set of documents that is persistently stored (in a
> > database). Each document in the index has a field 'id' that identifies
> > it, so these "external" IDs are stored in the database.
> >
> > Later on, a user loads the document IDs from the database and wants to
> > execute another search on this set of documents only. However,
> > performing a search on the full index and subsequently filtering the
> > results against that list of documents takes very long if there are many
> > matches. This is obvious as I have to retrieve the external id from each
> > matching document and check whether it is part of the desired sub-set.
> > Constructing a BooleanQuery in the style "id:Doc1 OR id:Doc2 ..." is not
> > suitable either because there could be thousands of documents exceeding
> > any limit for Boolean clauses.
> >
> > Any suggestions how to solve this? I would have gone for the Lucene
> > document numbers and store them as a bit set that I could use as a
> > filter during later searches, but I read that the document numbers are
> > ephemeral.
> >
> > One possible way out seems to be to create another index from the
> > documents that have matched the initial search, but this seems quite an
> > overkill, especially if there are plenty of them...
> >
> > Thanks for any hint!
> > Carsten
> >
> > --
> > Institut für Deutsche Sprache | http://www.ids-mannheim.de
> > Projekt KorAP                 | http://korap.ids-mannheim.de
> > Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> > Korpusanalyseplattform der nächsten Generation
> > Next Generation Corpus Analysis Platform
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message