Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 69501 invoked from network); 29 Nov 2007 13:38:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Nov 2007 13:38:17 -0000 Received: (qmail 59469 invoked by uid 500); 29 Nov 2007 13:38:03 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 59413 invoked by uid 500); 29 Nov 2007 13:38:03 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 59401 invoked by uid 99); 29 Nov 2007 13:38:03 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Nov 2007 05:38:03 -0800 X-ASF-Spam-Status: No, hits=-99.8 required=10.0 tests=ALL_TRUSTED,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Nov 2007 13:37:40 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4422A71420B for ; Thu, 29 Nov 2007 05:37:43 -0800 (PST) Message-ID: <10309643.1196343463276.JavaMail.jira@brutus> Date: Thu, 29 Nov 2007 05:37:43 -0800 (PST) From: "Shai Erera (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-1068) Invalid behavior of StandardTokenizerImpl In-Reply-To: <29758935.1196164483461.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1068: ------------------------------- Attachment: StandardTokenizerImpl-2.patch I've found a way to do it (I think): I've added a new type called ACRONYM_DEP that identifies the old ACRONYMs and fixed the current ACRONYM to identify proper ones. I also marked ACRONYM_DEP as deprecated. I added code to StandardTokenizer to set the type of a token to HOST if the type returned is ACRONYM_DEP. This behavior can be changed if you think the type should be set to ACRONYM, in case there are applications that count on the Token type. I wrote these 4 lines of code to verify it works: public static void main(String[] args) throws Exception { parse("www.abc.com."); parse("www.abc.com"); parse("I.B.M."); } public static void parse(String text) throws Exception { Analyzer analyzer = new StandardAnalyzer(); TokenStream ts = analyzer.tokenStream("content", new StringReader(text)); Token t; while ((t = ts.next()) != null) { System.out.println(t); } } And the output is: (www.abc.com.,0,12,type=) (www.abc.com,0,11,type=) (ibm,0,6,type=) > Invalid behavior of StandardTokenizerImpl > ----------------------------------------- > > Key: LUCENE-1068 > URL: https://issues.apache.org/jira/browse/LUCENE-1068 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Reporter: Shai Erera > Attachments: StandardTokenizerImpl-2.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch > > > The following code prints the output of StandardAnalyzer: > Analyzer analyzer = new StandardAnalyzer(); > TokenStream ts = analyzer.tokenStream("content", new StringReader("")); > Token t; > while ((t = ts.next()) != null) { > System.out.println(t); > } > If you pass "www.abc.com", the output is (www.abc.com,0,11,type=) (which is correct in my opinion). > However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=). > I think the behavior in the second case is incorrect for several reasons: > 1. It recognizes the string incorrectly (no argue on that). > 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal. > 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF. > I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition: > // acronyms: U.S.A., I.B.M., etc. > // use a post-filter to remove dots > ACRONYM = {ALPHA} "." ({ALPHA} ".")+ > Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to > ACRONYM = {LETTER} "." ({LETTER} ".")+ > and it solved the problem. > This was also reported here: > http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383 > http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org