Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 59128 invoked from network); 30 Nov 2007 06:54:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Nov 2007 06:54:19 -0000 Received: (qmail 48753 invoked by uid 500); 30 Nov 2007 06:54:05 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 48708 invoked by uid 500); 30 Nov 2007 06:54:05 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 48697 invoked by uid 99); 30 Nov 2007 06:54:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Nov 2007 22:54:05 -0800 X-ASF-Spam-Status: No, hits=-99.8 required=10.0 tests=ALL_TRUSTED,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Nov 2007 06:54:04 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 15302714209 for ; Thu, 29 Nov 2007 22:53:43 -0800 (PST) Message-ID: <18412785.1196405623055.JavaMail.jira@brutus> Date: Thu, 29 Nov 2007 22:53:43 -0800 (PST) From: "Shai Erera (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl In-Reply-To: <29758935.1196164483461.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1068: ------------------------------- Attachment: StandardTokenizerImpl-3.patch The previous patch I put was incorrect since it would still break existing applications. The current patch does: 1. Introduces a new type ACRONYM_DEP which is deprecated and recognizes the old ACRONYM format. 2. Fixes ACRONYM to recognize LETTER + "." (LETTER + ".")+. 3. Added a public member to StandardTokenizer and StandardAnalyzer replaceDepAcronym which can be set if the application would like the deprecated acronym format to be treated as ACRONYM or HOST. The default behavior, if not set is to recognize the old ACRONYM as HOST. This is how it should be used: public static void main(String[] args) throws Exception { parse("www.abc.com.", false); parse("www.abc.com.", true); parse("www.abc.com", true); parse("I.B.M.", true); } public static void parse(String text, boolean replaceDepAcronym) throws Exception { StandardAnalyzer analyzer = new StandardAnalyzer(); analyzer.replaceDepAcronym = replaceDepAcronym; TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token t; while ((t = ts.next()) != null) { System.out.println(t); } } And here is the output: (wwwabccom,0,12,type=) (www.abc.com.,0,12,type=) (www.abc.com,0,11,type=) (ibm,0,6,type=) The member is marked deprecated so we can remove it in the next release. Applications that would like to new behavior need to do nothing, and therefore will not be impacted once we remove that member. Applications that want the old behavior need to explicitly set it and in the next major release remove it. I think that solves it. How should I proceed? > Invalid behavior of StandardTokenizerImpl > ----------------------------------------- > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Shai Erera > Attachments: StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new StringReader("")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to > ACRONYM = {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org