Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5BDD9740B for ; Sat, 12 Nov 2011 01:57:17 +0000 (UTC) Received: (qmail 80865 invoked by uid 500); 12 Nov 2011 01:57:16 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 80794 invoked by uid 500); 12 Nov 2011 01:57:16 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 80787 invoked by uid 99); 12 Nov 2011 01:57:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Nov 2011 01:57:16 +0000 X-ASF-Spam-Status: No, hits=-2001.2 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Nov 2011 01:57:13 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id AD9AC4F96D for ; Sat, 12 Nov 2011 01:56:51 +0000 (UTC) Date: Sat, 12 Nov 2011 01:56:51 +0000 (UTC) From: "Robert Muir (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <145002696.23293.1321063011712.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <318605414.5210.1310455200127.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3305?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D131= 48933#comment-13148933 ]=20 Robert Muir commented on LUCENE-3305: ------------------------------------- looks like we want to add the Lucene analyzer/tokenizer and solr factories = from kuromoji-solr-0.5.3-asf.tar.gz I'd say once we get stuff going, maybe just download the dictionary, build = it, and when committing commit the built dictionary under resources/ folder (this is where the script puts= it). I think for this kind of feature it might be hard to iterate with patches, = we should maybe try to get it=20 in SVN (trunk) initially and iterate with smaller issues. The code looks pr= etty clean to me already. The produced jar file is somewhat large but I think its still reasonable, s= o I think we should look past this for now? working with Sen before I know some ways we can shrink this a= lot, but that would be best on a future issue. Some java6 apis are here (e.g. unicode normalization). Christian can you co= nfirm this is only for the=20 dictionary-build stage? It looked to me like its only needed for ipadic/uni= dic parsing, but not custom dictionary support. If its only for the build stage, personally I think thats fine for 3.x too,= because I'm suggesting we=20 commit a 'built' dictionary and we tell people if they want to compile the = dictionary themselves they=20 need java6? We could put the dictionary-building under a tools/ directory t= hats java6-only, or we could=20 depend on ICU for just the tools/ piece (i think we already have such hacks= for generating jflex rules for StandardTokenizer) and be fine on java5. +1 for the GraphVizFormatter...=20 =20 > Kuromoji code donation - a new Japanese morphological analyzer > -------------------------------------------------------------- > > Key: LUCENE-3305 > URL: https://issues.apache.org/jira/browse/LUCENE-3305 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Reporter: Christian Moen > Assignee: Simon Willnauer > Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, ip-= clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, kuromoji-0.7.6-asf.tar.g= z, kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5= .3.tar.gz > > > Atilika Inc. (=E3=82=A2=E3=83=86=E3=82=A3=E3=83=AA=E3=82=AB=E6=A0=AA=E5= =BC=8F=E4=BC=9A=E7=A4=BE) would like to donate the Kuromoji Japanese morpho= logical analyzer to the Apache Software Foundation in the hope that it will= be useful to Lucene and Solr users in Japan and elsewhere. > The project was started in 2010 since we couldn't find any high-quality, = actively maintained and easy-to-use Java-based Japanese morphological analy= zers, and these become many of our design goals for Kuromoji. > Kuromoji also has a segmentation mode that is particularly useful for sea= rch, which we hope will interest Lucene and Solr users. Compound-nouns, su= ch as =E9=96=A2=E8=A5=BF=E5=9B=BD=E9=9A=9B=E7=A9=BA=E6=B8=AF (Kansai Intern= ational Airport) and =E6=97=A5=E6=9C=AC=E7=B5=8C=E6=B8=88=E6=96=B0=E8=81=9E= (Nikkei Newspaper), are segmented as one token with most analyzers. As a = result, a search for =E7=A9=BA=E6=B8=AF (airport) or =E6=96=B0=E8=81=9E (ne= wspaper) will not give you a for in these words. Kuromoji can segment thes= e words into =E9=96=A2=E8=A5=BF =E5=9B=BD=E9=9A=9B =E7=A9=BA=E6=B8=AF and = =E6=97=A5=E6=9C=AC =E7=B5=8C=E6=B8=88 =E6=96=B0=E8=81=9E, which is generall= y what you would want for search and you'll get a hit. > We also wanted to make sure the technology has a license that makes it co= mpatible with other Apache Software Foundation software to maximize its use= fulness. Kuromoji has an Apache License 2.0 and all code is currently owne= d by Atilika Inc. The software has been developed by my good friend and ex= -colleague Masaru Hasegawa and myself. > Kuromoji uses the so-called IPADIC for its dictionary/statistical model a= nd its license terms are described in NOTICE.txt. > I'll upload code distributions and their corresponding hashes and I'd ver= y much like to start the code grant process. I'm also happy to provide pat= ches to integrate Kuromoji into the codebase, if you prefer that. > Please advise on how you'd like me to proceed with this. Thank you. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrato= rs: https://issues.apache.org/jira/secure/ContactAdministrators!default.jsp= a For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org