lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Practical usages of arbitrary Shingles when using a query parser?
Date Mon, 30 Jul 2018 22:46:37 GMT

Although I've been aware of Shings and some of the useful applications for 
a long time, today is the first tiem i really sat down and tried to do 
something non-trivial with them myself.

My objective seems realatively straight forard: given a corpus of text and 
some analyzer (for sake of discussion let's assume simple whitespace 
tokenization w/lowercasing) i want to be able to say "I am happy to trade 
index time/size for faster queries of shorter phrases"

So instead of just indexing "the quick brown fox jumped over the lazy dog" 
as a field with 9 terms, I might want to add ShingleFilterFactory to the 
end of my analyzer using [[minShingleSize="2" maxShingleSize="2" 
outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a 
query for a "phrase" of 2 words/terms, i should in theory be able to just 
use a TermQuery under the covers -- making just as "fast" as query for a 
single word/term.  But meanwhile longer phrases should still "just work" 
as if i didn't have any shingles.

So far so good...

If I actually index a corpus as described above, and then at query time I 
use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2" 
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the 
expected TemQuery for either a single word input or two-word input ... 
for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one 
composed of bi-shingles instead of individual unigrams, but AFAICT the 
position info is set correctly so that it will only match the documents 
thta would have been matched w/o any shingles (and IIUC the term stats 
for the shingles seem like should probably result in subjectively "better" 
scores? not certain on this bit, but also not overly concerend about it)

The problem is that (unless I'm missing something) this doesn't really 
work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.

If i change my index time ShingleFilterFactory uses [[minShingleSize="2" 
maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the 
query time analyzer would be [[minShingleSize="2" maxShingleSize="N" 
outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while 
that does seem to cause "phrase" input of all sizes to be converted by the 
analyzer+QueryParser into a query that (AFAICT) will match the correct 
documents (compared to using no shingles) it's only "optimized" as a 
TermQuery for one & two word phrases.  For input phrasees longer then 2 
terms it generates a SpanOrQuery wrapping multiple SpanNearQueries, 
i believe because of the overlapping positions of the bi/tri/quad-etc.. 
shingles.

There just doesn't seem to be any good/generic way to leverage a field 
built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]] 
(where X != Y) at query time using an QueryParser configured with out of 
the box analyzer components.

It seems like what's missing is a ShingleFilter(Factory) configuration 
that means "output the maximum possible shingle size between MIN and 
MAX based on the size of the input stream" ... but that doesn't seem to 
exist.

Does anyone have any advice/suggestions on how to approach this type of 
problem based on their own experiences?  Does anyone have first hand 
experience using maxShingleSize > 2 with a QueryParser (and w/o any 
preconcieved assumptions about the length of the input) ?

 	?

-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message