Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 15928 invoked from network); 18 Dec 2002 02:55:34 -0000 Received: from exchange.sun.com (HELO nagoya.betaversion.org) (192.18.33.10) by daedalus.apache.org with SMTP; 18 Dec 2002 02:55:34 -0000 Received: (qmail 935 invoked by uid 97); 18 Dec 2002 02:56:52 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 892 invoked by uid 97); 18 Dec 2002 02:56:51 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 880 invoked by uid 98); 18 Dec 2002 02:56:50 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) Message-ID: <20021218025537.91321.qmail@web12701.mail.yahoo.com> Date: Tue, 17 Dec 2002 18:55:37 -0800 (PST) From: Otis Gospodnetic Subject: RE: Analyzers for various languages To: Lucene Developers List In-Reply-To: <187D6D956106D84E9D8B280F6458FE140F5AE1@merc12.na.sas.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Eric, I'd rather keep it in the sandbox, at least during development. We can always move it to core Lucene repository later on. Lucene's build.xml can get some new targets and some optional properties in *build.properties files that point to the location where the source for different analyzers is, so one could compile it using Lucene's build.xml as well, by simply setting the appropriate property(-ies). If people like this so much that the core starts looking like a better place to keep this, we can move it. Those are my thoughts.... Otis --- Eric Isakson wrote: > Hi Otis, > > I was thinking I'd stay clear of the word "lang" in these as some are > suitable for multiple languages (like the ispell based Analyzer). > > I was also thinking the core would be a better place for these so > there would be no manual intervention to make them part of the > release (just my opinion, wondering what others think about this). > How about we add an analyzers subdir parallel to the src/java dir in > core. This follows the same pattern used for the demo and test code. > So the source tree would look something like: > > src/ > analyzers/ > org/ > apache/ > lucene/ > analysis/ > ru/ > standard/ > de/ > ispell/ > ... > demo/ > ... demo packages here ... > java/ > ... core packages here ... > test/ > ... test packages here ... > > > I would then modify build.xml to compile these and create appropriate > jars for each analyzer and add the appropriate targets to the > "package" target's depends list. So we would end up with something > like: > > lucene-1.3-dev1.jar > lucene-demos-1.3-dev1.jar > lucene-analyzers-1.3-dev1.jar (all analyzers for convenience) > lucene-analyzer-ru-1.3-dev1.jar > lucene-analyzer-standard-1.3-dev1.jar > lucene-analyzer-de-1.3-dev1.jar > lucene-analyzer-ispell-1.3-dev1.jar > ... > > So, someone who just wants all the analyzers need only get the > lucene-analyzers-1.3-dev1.jar (and the core) and people that want a > small distribution, just get the core and specific analyzers they > need to distribute. > > Would you still prefer these were moved to the sandbox? I could > certainly work on this either way once a direction is set by those > that vote on such things. > > I have no analyzers I've developed myself yet. I was thinking I would > contact those with contributions that haven't been committed yet and > (with permission) help with some of the grunt work of adding the > appropriate license headers and creating the necessary patches for > review and inclusion. > > Eric > > -----Original Message----- > From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] > Sent: Saturday, December 14, 2002 11:17 AM > To: Lucene Developers List > Subject: Re: Analyzers for various languages > > > Hello Eric, > > Thanks for volunteering. I agree with your suggestions in the last > paragraph of your email. I think a suitable place for the > language-specific code would be Lucene Sandbox, at least for now. > I think a single project dir for all of them would suffice, organized > like this: > > jakarta-lucene-sandbox/projects/lang/ > > Underneath the code could be in: > > src/java/org/apache/lucene/analysis/de > src/java/org/apache/lucene/analysis/en > src/java/org/apache/lucene/analysis/fr > src/java/org/apache/lucene/analysis/ru > .... > which is just like it is in the core Lucene now. > (Maybe something with 'lang' in it would be better, not sure...?) > > If you want I can create the structure and you can post your code to > lucene-dev for review before committing it in the CVS. > > As far as building these, and making them easily available to users > of > core Lucene, I think it would be nice to have a build file that can > build each language separately and put it in a small well named Jar > (e.g. lucene-lang-de.jar) of its own, or build all languages into a > single Jar (e.g. lucene-lang.jar). > > I am not sure about the best way to let Lucene users access these > Jars, > other than (manually?) including them in the release directory > whenever > Lucene is released. > But we don't need to know this right now, getting started with the > code > is more important. > > Also, this would require that existing language-specific code (de, > ru, > default English) eventually be moved out of Lucene core into this > language pack. This should probably include English, too, if other > code doesn't have dependencies on it, and compiles without it. > > I wonder what other developers' thoughts are... > > Otis > > > --- Eric Isakson wrote: > > Hi All, > > > > I want to volunteer to help get language modules organized into the > > CVS and builds. > > > > I've been lurking on the lists here for a couple months and working > > with and getting familiar with Lucene. I'm investigating the use of > > lucene to support our help system's fulltext search requirements. I > > have to build indices for multiple languages. I just poked around > the > > CVS archives and found only the German, Russian and > standard(English) > > analyzers in the core and nothing in the sandbox. In the list > > archives I've found many references to folks using Lucene for > several > > other languages. I did find the CJKTokenizer, Dutch and French > > analyzers and have put those into my tests. Is there somewhere > these > > analyzers are organized that I might get a hold of the sources for > > other languages to build into my toolset? There were a couple > > mentioned that several of you appear to be using that I can't find > > the sources for (most notably > > http://www.halyava.ru/do/org.apache.lucene.analysis.zip > > which > > gives a "Cannot find server" error). > > > > In order to meet the requirements for my product these are the > > languages I have to support: > > > > Must Support > > ------------ > > English > > Japanese > > Chinese > > Korean > > French > > German > > Italian > > Polish > > > > Not Sure Yet > > ------------ > > Czech > > Danish > > Hebrew > > Hungarian > > Russian > > Spanish > > Swedish > > > > I understand the issues that were raised about putting language > > modules in the core and then not being able to support them, but it > > seems they have not been put anywhere. I would be willing to try > and > > get them into a central place that people can access them or help > > someone that is already working on that. I can't commit today to > > being able to maintain or bugfix contributions, but should my > company > > adopt Lucene as our search engine (which seems likely at this > point) > > I'll do what I can to contribute back any fixes we make. I also > have > > a personal interest in the project since I've found Lucene quite > > interesting to be working with and I've enjoyed learning about > > internationalizing java apps. > > > > I'll volunteer to help gather and organize these somewhere if I > were > > given committer rights to the appropriate area and folks would be > > willing to send me their language modules. > > > > I recall some discussion about moving language modules out of the > > core, but I don't think any decisions were made about where to put > > them (perhaps this is why they aren't in the CVS at all). I was > > thinking perhaps give each language a sandbox project or create > > language packages in the core build that could be enabled via > > settings in the build.properties file. Using the build.properties > > file could allow us to create a jar for each language during the > core > > build so folks could install just the language modules they want > and > > if a language module starts breaking due to changes in the core it > > could easily be turned off until fixes were made to that module. I > > can start working on a setup like this in my local source tree next > > week using the existing language modules in the core if you all > think > > this would be a good approach. If not, does anyone have a proposal > > for where these belong so we can get some movement on getting them > > committed to CVS? > > > > Regards, > > Eric > > -- > > Eric D. Isakson SAS Institute Inc. > > Application Developer SAS Campus Drive > > XML Technologies Cary, NC 27513 > > (919) 531-3639 http://www.sas.com > > > > > > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > http://mailplus.yahoo.com > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > > > -- > To unsubscribe, e-mail: > > For additional commands, e-mail: > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: For additional commands, e-mail: