lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Ferenczi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder
Date Tue, 17 Jan 2017 14:00:29 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826085#comment-15826085
] 

Jim Ferenczi commented on LUCENE-7638:
--------------------------------------

Thanks for taking a look @mattweber and [~mikemccand]

{quote}
Can this handle "side paths on side paths" graph structure (I think you called this "nested
multi-term synonyms")? While no analysis chain can naturally produce this today (that I know
of!), the TokenStream attributes can easily express it. And you could imagine it happening
in the future, e.g. if you use Kuromoji tokenizer or WordDelimiterGraphFilter followed by
a SynonymGraphFilter (except we'd need to e.g. use the synonym graph filter from LUCENE-5012,
which can correctly consume a graph). If this is expected to work maybe we should add a test
case showing that?
{quote}

I'll add a test case because it's expected to work ;) . This is also the reason why this patch
does not produce {code}PhraseQuery{code} for synonyms.   For simple "side paths" this is easy
to do but we would need to switch to Span queries for "side paths on side paths" so I though
that it could be done in another issue. 

{quote}
It seems like you don't need to be using Term here, except at the end to pass to the newXXXQuery,
since everything is in a single field here, and we are hoping to move away from Term entirely
(LUCENE-7632)?
{quote}

Thanks, I'll simplify the patch.

{quote}
Holes are challenging for graph token streams ... can you add a test case that encounters
holes, e.g. simulated StopFilter? There are at least two "fun" cases: a hole that cuts the
graph entirely into two partitions, and a synonym spanning over a hole ... CannedTokenStream
is useful for feeding such "interesting" cases.
{quote}

I though that holes would not be a problem for boolean queries but now I am not sure. I'll
test that.

{quote}
The Path.id seems to be used only for tie-breaking on compare, not for lookup in the TreeSet
as the comment suggests?
{quote}

The comment is misleading. It is needed because I use {code}TreeSet#remove{code} which uses
compare to check object equality. So the {code}Path.id{code} is the unique identifier for
the path.
 




> Optimize graph query produced by QueryBuilder
> ---------------------------------------------
>
>                 Key: LUCENE-7638
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7638
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream contains token
with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where synonyms can
span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph TokenStream.
For instance if you have a synonym like "ny, new york" and you search for "new york city",
the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient to build
a boolean query that merges all the intersections in the graph. So instead of "new york city",
"ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each intersection in the
graph. This is not done in this patch but we could also create a specialized query that gives
equivalent scores to multi terms synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could also use
the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple muli-terms
synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message