lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adriano Crestani (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1486) Wildcards, ORs etc inside Phrase queries
Date Wed, 22 Jul 2009 18:39:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734241#action_12734241
] 

Adriano Crestani commented on LUCENE-1486:
------------------------------------------

Hi Mark H.,

Thanks for the response, some comments inline:

{quote}
Correct, the "inner phrase" example was a term not a phrase. This is perhaps a better example:

checkBadQuery("\"jo* \"percival smith\" \""); //phrases inside phrases is bad
{quote}

I think you did not get what I meant, even with your new example, there is no inner phrase,
it is: a phrase <"jo* ">, followed by a term <percival>, followed by another term
<smith>, and an empty phrase <" ">. So, with your change, the junit passes, but
for the wrong reason. It gets an exception complaining about the empty phrase and not because
there is an inner phrase (I still don't see how you can type an inner phrase with the current
syntax). I think it's not a big deal, but I'm just trying to understand and raise a probable
wrong test. I expect you understood what I mean, let me know if I did not make it clear.

{quote}
The Junit is currently the main form of documentation
{quote}

But not the ideal, because the source code (junit code) is not released in the binary release.
So, the ideal place should be in the javadocs.

{quote}

    * Wildcard/fuzzy/range clauses can be used to define a phrase element (as opposed to simply
single terms)
    * Brackets are used to group/define the acceptable variations for a given phrase element
e.g. "(john OR jonathon) smith"
    * "AND" is irrelevant - there is effectively an implied "AND_NEXT_TO" binding all phrase
elements

{quote}

Thanks, now it's clearer for me what is supported or not. I have some questions:

I understand this AND_NEXT_TO implicit operator between the queries inside the phrase. However,
what happens if the user do not type any explicit boolean operator between two terms inside
parentheses: "(query parser) lucene". Is the operator between 'query' and 'parser' the implicit
AND_NEXT_TO or the default boolean operator (usually OR)?

What happens if I type "(query AND parser) lucene". In my point of view it is: "(query AND
parser) AND_NEXT_TO lucene". Which means for me: find any document that contains the term
'query' and the term 'parser' in the position x, and the term 'lucene' in the position x+1.
Is this the expected behaviour?

{quote}
1) Keep in core and improve error reporting and documentation
2) Move into "contrib" as experimental
3) Retain in core but simplify it to support only the simplest syntax (as in my Britney~ example)
4) Re-engineer the QueryParser.jj to support a formally defined syntax for acceptable "within
phrase" operators e.g. *, ~, ( )
{quote}

1 is good, but I would prefer 4 too. Documentation and throw the right exception are necessary.
I just don't feel confortable on the complex phrase query parser relying on the main query
parser syntax, any change on the main one could easialy brake the complex phrase QP. Anyway,
4 may be done in future :)

Mark M.:

{quote}
With the new info from Mark H, how hard would it be to create a new imp for the new parser
that did a lot of this, in a more defined way? It seems you basically just want to be able
to use multiterm queries and group/or things, right? We could even relax a little if we have
to. This hasn't been released, so there is still a lot of wiggle room I think. But there does
have to be a resolution with this and the new parser at some point either way.
{quote}

Yes, I am working on the new query parser code. I started recently to read and understand
how the ComplexPhraseQP works, so I could reproduce the behaviour using the new QP framework.
I first tried to look at this QP as a user and could not figure out what exactly I can or
not do with it. I think now we are hitting a big problem, which is related to documentation.
That is why I started raising these question, because others could also have the same issues
in future.

So, yes, I can start coding some equivalent QP using the new QP framework, I'm just questioning
and trying to understand everything before I start any coding. I don't wanna code anything
that wil throw ConcurrentModificationExceptions, that's why I'm raising these issues now,
before I start moving it to the new QP.

Best Regards,
Adriano Crestani Campos


> Wildcards, ORs etc inside Phrase queries
> ----------------------------------------
>
>                 Key: LUCENE-1486
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1486
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: QueryParser
>    Affects Versions: 2.4
>            Reporter: Mark Harwood
>            Assignee: Mark Harwood
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ComplexPhraseQueryParser.java, junit_complex_phrase_qp_07_21_2009.patch,
junit_complex_phrase_qp_07_22_2009.patch, LUCENE-1486.patch, LUCENE-1486.patch, LUCENE-1486.patch,
LUCENE-1486.patch, TestComplexPhraseQuery.java
>
>
> An extension to the default QueryParser that overrides the parsing of PhraseQueries to
allow more complex syntax e.g. wildcards in phrase queries.
> The implementation feels a little hacky - this is arguably better handled in QueryParser
itself. This works as a proof of concept  for much of the query parser syntax. Examples from
the Junit test include:
> 		checkMatches("\"j*   smyth~\"", "1,2"); //wildcards and fuzzies are OK in phrases
> 		checkMatches("\"(jo* -john)  smith\"", "2"); // boolean logic works
> 		checkMatches("\"jo*  smith\"~2", "1,2,3"); // position logic works.
> 		
> 		checkBadQuery("\"jo*  id:1 smith\""); //mixing fields in a phrase is bad
> 		checkBadQuery("\"jo* \"smith\" \""); //phrases inside phrases is bad
> 		checkBadQuery("\"jo* [sma TO smZ]\" \""); //range queries inside phrases not supported
> Code plus Junit test to follow...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message