Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 52434 invoked from network); 27 May 2010 07:22:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 May 2010 07:22:23 -0000 Received: (qmail 38506 invoked by uid 500); 27 May 2010 07:22:22 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 38396 invoked by uid 500); 27 May 2010 07:22:20 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 38381 invoked by uid 99); 27 May 2010 07:22:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 07:22:19 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 May 2010 07:22:16 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o4R7Ltif021052 for ; Thu, 27 May 2010 07:21:55 GMT Message-ID: <5938853.17391274944915055.JavaMail.jira@thor> Date: Thu, 27 May 2010 03:21:55 -0400 (EDT) From: "Uwe Schindler (JIRA)" To: dev@lucene.apache.org Subject: [jira] Issue Comment Edited: (LUCENE-2167) Implement StandardTokenizer with the UAX#29 Standard MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12872124#action_12872124 ] Uwe Schindler edited comment on LUCENE-2167 at 5/27/10 3:21 AM: ---------------------------------------------------------------- Hi Steven, looks cool, I have some suggestions: - Must it be a maven plugin? From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. The external dependency to httpclient can be replaces by simply java.net.HttpUrlConnection and the URL itsself (you can even set the no-cache directives). Its much easier from ant to invoke a java method as a build step. So why not refactor a little bit to use a main() method that acceps the target directory. - You use the HTML root zone database from IANA. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in the standardized named-format @ [http://www.internic.net/zones/root.zone] (ASCII only, as DNS is ASCII only). You just have to use all rows that are not comments and contain "NS" as second token. The nameservers behind are not used, just use the DNS name before. This should be much easier to do. A python script may also work well. - You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file. - The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too. - About changes in analyzer syntax because of regeneration: This should not be a problem, as the IANA only *adds* new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should *not* appear in tokenized text *before* they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers. Fine idea! was (Author: thetaphi): Hi Steven, looks cool, I have some suggestions: - Must it be a maven plugin? From what I see, the same code could be done as a simple Java Class with main() like Roberts ICU converter. The external dependency to httpclient can be replaces by simply java.net.HttpUrlConnection and the URL itsself (you can even set the no-cache directives). Its much easier from ant to invoke a java method as a build step. So why not refactor a little bit to use a main() method that acceps the target directory. - You use the HTML root zone database from iana. The format of this file is hard to parse and may change suddenly. BIND administrators know, that there is also the root zone file available for BIND in ste standardized named-format @ [http://www.internic.net/zones/root.zone] (ASCII only, as DNS is ASCII only). You just have to use all rows that are not comments and contain "NS" as second token. The Nameservers behind are not used, just use the DNS name before. This should be much easier to do. A python script may also work well. - You can write the Last-Modified-Header of the HTTP-date (HttpURLConnection.getLastModified()) also into the generated file. - The database only contains the punycode enabled DNS names. But users use the non-encoded variants, so you should decode punycode, too [we need ICU for that :( ] and create patterns for that, too. - About changes in analyzer syntax because of that: This should not be a problem, as the IANA only *adds* new zones to the file and very seldom removes some (like old yugoslavian zones). As eMails and Webadresses should *not* appear in tokenized text *before* they are in the zone file, its no problem that they suddenly later are marked as "URL/eMail" (as they cannot appear before). So in my opinion we can update the zone database even in minor Lucene releases without breaking analyzers. Fine idea! > Implement StandardTokenizer with the UAX#29 Standard > ---------------------------------------------------- > > Key: LUCENE-2167 > URL: https://issues.apache.org/jira/browse/LUCENE-2167 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Affects Versions: 3.1 > Reporter: Shyamal Prasad > Assignee: Steven Rowe > Priority: Minor > Attachments: LUCENE-2167-lucene-buildhelper-maven-plugin.patch, LUCENE-2167.benchmark.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch, LUCENE-2167.patch > > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > It would be really nice for StandardTokenizer to adhere straight to the standard as much as we can with jflex. Then its name would actually make sense. > Such a transition would involve renaming the old StandardTokenizer to EuropeanTokenizer, as its javadoc claims: > bq. This should be a good tokenizer for most European-language documents > The new StandardTokenizer could then say > bq. This should be a good tokenizer for most languages. > All the english/euro-centric stuff like the acronym/company/apostrophe stuff can stay with that EuropeanTokenizer, and it could be used by the european analyzers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org