Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <27514625.71641274025162609.JavaMail.jira@thor>
Date: Sun, 16 May 2010 11:52:42 -0400 (EDT)
From: "Robert Muir (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-2465) QueryParser should ignore
 double-quotes if mid-word
In-Reply-To: <2148872.66971273951782729.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-2465?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D128=
68001#action_12868001 ]=20

Robert Muir commented on LUCENE-2465:
-------------------------------------

bq. Also, GERSHAYIM is simply not a valid argument - users cannot type Unic=
ode, they type text.

I am suggesting we follow the rules of unicode, for a few reasons.
# This is not unique to hebrew Gershayim. The same problem is found in nume=
rous other languages, where query parser syntax overlaps with "incorrect un=
icode" text in those languages. I have this same issue with the conflation =
of : and Bengali =E0=A6=83, and in some other charsets there is only one gl=
yph for both.
# Adding some heuristic that does not obey the rules of unicode risks break=
ing other languages. While it might seem perfectly harmless, we risk doing =
harmful things to other languages. This is like what happens to Chinese tex=
t today.
# Disambiguating when a ' should be a gershayim is really app-dependent, ju=
st like disambiguating when : should be  =E0=A6=83. Its a subproblem of cha=
racter set conversion (which is not always lossless and exact), and charset=
 conversion doesnt belong in the query parser.

So, adding some of the heuristics i see here will change phrase queries for=
 example, for languages that dont use spaces between words like Thai. Tryin=
g to base it on Unicode properties, is very risky, ultimately it will proba=
bly break some language because words arent just sequences of letters separ=
ated by whitespace in all languages.

Furthermore, by following Unicode, we keep QP simpler, and it won't uninten=
tionally or unknowingly break for any existent or future languages (such as=
 ones not even in Unicode yet).


> QueryParser should ignore double-quotes if mid-word
> ---------------------------------------------------
>
>                 Key: LUCENE-2465
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2465
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: QueryParser
>    Affects Versions: 1.9, 2.0.0, 2.1, 2.2, 2.3, 2.3.1, 2.3.2, 2.3.3, 2.4,=
 2.4.1, 2.4.2, 2.9, 2.9.1, 2.9.2, 2.9.3, 3.0, Flex Branch, 3.0.1, 3.0.2, 3.=
1, 4.0
>            Reporter: Itamar Syn-Hershko
>
> Current implementation of Lucene's QueryParser identifies a phrase in the=
 query when hitting a double-quotes char, even if it is mid-word. For examp=
le, the string ' Foo"bar test" ' will produce a BooleanQuery, holding one t=
erm and one PhraseQuery ("bar test"). This behavior is somewhat flawed; a P=
hrase is a group of words surrounded by double quotes as defined by http://=
lucene.apache.org/java/2_4_0/queryparsersyntax.html, but no-where does it s=
ay double-quotes will also tokenize the input. Arguably, a phrase should on=
ly be identified as such when it is also surrounded by whitespaces.
> Other than a logically incorrect behavior, this makes parsing of Hebrew a=
cronyms impossible. Hebrew acronyms contain one double-quotes char in the m=
iddle of a word (for example, MNK"L), hence causing the QP to throw a synta=
x exception, since it is expecting another double-quotes to create a phrase=
 query, essentially splitting the acronym into two.
> The solution to this is pretty simple - changing the JavaCC syntax to che=
ck if a whitespace precedes the double-quote when a phrase opening is expec=
ted, or peek to see if a whitespace follows the double-quotes if a phrase c=
losing is expected.
> This will both eliminate a logically incorrect behavior which shouldn't b=
e relied on anyway, and allow Hebrew queries to be correctly parsed also wh=
en containing acronyms.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org