lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Practical usages of arbitrary Shingles when using a query parser?
Date Tue, 31 Jul 2018 07:43:11 GMT
Hi Hoss,

The query parser is confused by these overlapping positions indeed, which
it interprets as synonyms. I was going to write that you should set the
same min and max shingle sizes at query time, but while writing that I
realized that you probably wanted to keep outputing shorter shingles so
that a phrase query on 2 terms with a max shingle size of 3 would still use
shingles? Maybe 'outputUnigramsIfNoShingles' should really be something
like 'outputShinglesOfTheMaximumSizeOnly'?

For the record, in addition to the problems that you mentioned,
ShingleFilter proved very hard to be fixed in order to work correctly on
top of synonyms when X != Y[1], which encouraged Alan work on a new
FixedShingleFilter[2] that deals with index-time synonyms (ie. ignores
position length) just fine but only allows X == Y. Also instead of feeding
an analyzer with shingles to the query parser, we found it more
user-friendly to add an option to text fields in order to index 2-shingles
into a separate field and redirect phrase queries to it.[3] We did
something similar for edge-ngrams[4] to optimize prefix queries based on
the same problem that you need more than appending an EdgeNGramTokenFilter
to you analysis chain to make prefix queries efficient. In the end we might
remove the ability to set shingle or ngram filters in analyzers and just
make them implementation details of the aforementioned options.

[1] https://issues.apache.org/jira/browse/LUCENE-3475
[2] https://issues.apache.org/jira/browse/LUCENE-8202
[3] https://github.com/elastic/elasticsearch/pull/30450
[4] https://github.com/elastic/elasticsearch/pull/28290

Le mar. 31 juil. 2018 à 00:46, Chris Hostetter <hossman_lucene@fucit.org> a
écrit :

>
> Although I've been aware of Shings and some of the useful applications for
> a long time, today is the first tiem i really sat down and tried to do
> something non-trivial with them myself.
>
> My objective seems realatively straight forard: given a corpus of text and
> some analyzer (for sake of discussion let's assume simple whitespace
> tokenization w/lowercasing) i want to be able to say "I am happy to trade
> index time/size for faster queries of shorter phrases"
>
> So instead of just indexing "the quick brown fox jumped over the lazy dog"
> as a field with 9 terms, I might want to add ShingleFilterFactory to the
> end of my analyzer using [[minShingleSize="2" maxShingleSize="2"
> outputUnigrams="true"]] and now I have a field w/17 terms, but if I get a
> query for a "phrase" of 2 words/terms, i should in theory be able to just
> use a TermQuery under the covers -- making just as "fast" as query for a
> single word/term.  But meanwhile longer phrases should still "just work"
> as if i didn't have any shingles.
>
> So far so good...
>
> If I actually index a corpus as described above, and then at query time I
> use ShingleFilterFactory w/ [[minShingleSize="2" maxShingleSize="2"
> outputUnigramsIfNoShingles="true" outputUnigrams="false"]] I get the
> expected TemQuery for either a single word input or two-word input ...
> for input "phrases" longer then 2 terms I get a PhraseQuery -- albeit one
> composed of bi-shingles instead of individual unigrams, but AFAICT the
> position info is set correctly so that it will only match the documents
> thta would have been matched w/o any shingles (and IIUC the term stats
> for the shingles seem like should probably result in subjectively "better"
> scores? not certain on this bit, but also not overly concerend about it)
>
> The problem is that (unless I'm missing something) this doesn't really
> work if I want to use an arbitrary 'maxShingleSize="N"' where N>2.
>
> If i change my index time ShingleFilterFactory uses [[minShingleSize="2"
> maxShingleSize="N" outputUnigrams="true"]] the equivilent change to the
> query time analyzer would be [[minShingleSize="2" maxShingleSize="N"
> outputUnigramsIfNoShingles="true" outputUnigrams="false"]] -- and while
> that does seem to cause "phrase" input of all sizes to be converted by the
> analyzer+QueryParser into a query that (AFAICT) will match the correct
> documents (compared to using no shingles) it's only "optimized" as a
> TermQuery for one & two word phrases.  For input phrasees longer then 2
> terms it generates a SpanOrQuery wrapping multiple SpanNearQueries,
> i believe because of the overlapping positions of the bi/tri/quad-etc..
> shingles.
>
> There just doesn't seem to be any good/generic way to leverage a field
> built with an analyzer that uses [[minShingleSize="X" maxShingleSize="Y"]]
> (where X != Y) at query time using an QueryParser configured with out of
> the box analyzer components.
>
> It seems like what's missing is a ShingleFilter(Factory) configuration
> that means "output the maximum possible shingle size between MIN and
> MAX based on the size of the input stream" ... but that doesn't seem to
> exist.
>
> Does anyone have any advice/suggestions on how to approach this type of
> problem based on their own experiences?  Does anyone have first hand
> experience using maxShingleSize > 2 with a QueryParser (and w/o any
> preconcieved assumptions about the length of the input) ?
>
>         ?
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message