lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil ŢORIN <torin...@gmail.com>
Subject Re: Small Vocabulary
Date Tue, 07 Aug 2012 08:20:53 GMT
If you do intersection (not join), maybe it make sense to put every
thing into 1 index?

Just transform your input like "brown fox" into "ADJ:brown|<your
payload> NOUN:fox|<other payload>"

Write a custom tokenizer, some filters and that's it.

Of course I'm not aware of all the details, so my solution might not
be applicable to your project.
Maybe you could share more details, so this won't transform in "XY problem".

Keep in mind : always optimize your index for the query usecase,
instead of blindly processing the input data.


On Tue, Aug 7, 2012 at 10:29 AM, Carsten Schnober
<schnober@ids-mannheim.de> wrote:
> Am 06.08.2012 20:29, schrieb Mike Sokolov:
>
> Hi Mike,
>
>> There was some interesting work done on optimizing queries including
>> very common words (stop words) that I think overlaps with your problem.
>> See this blog post
>> http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>> from the Hathi Trust.
>>
>> The upshot in a nutshell was that queries including terms with very
>> large postings lists (ie high occurrences) were slow, and the approach
>> they took to dealing with this was to index n-grams (ie pairs and
>> triplets of adjacent tokens).  However I'm not sure this would help much
>> if your queries will typically include only a single token.
>
> This is very interesting for our use case indeed. However, you are right
> that indexing n-grams is not (per sé) a solution for my given problem
> because I'm working on an application using multiple indexes. A query
> for one isolated frequent term will indeed be rare presumably, or at
> least rare enough to tolerate slow response times, but the results will
> typically be intersected with results from other indexes.
>
> To illustrate this more practically: the index I described having
> relatively few distinct and partially extremely frequent tokens indexes
> part-of-speech (POS) tags with positional information stored in the
> payload. A parallel index indexes actual text; a typical query may look
> for a certain POS tag in one index and a word X at the same position
> with a matching payload in the other index. So both indexes need to be
> queries completely before the intersection can be performed.
>
> Best,
> Carsten
>
>
>
> --
> Institut für Deutsche Sprache | http://www.ids-mannheim.de
> Projekt KorAP                 | http://korap.ids-mannheim.de
> Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
> Korpusanalyseplattform der nächsten Generation
> Next Generation Corpus Analysis Platform
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message