lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: languages supported by lucene 1.2.1 in eclipse help system
Date Tue, 27 Apr 2004 22:42:13 GMT
I'm assuming what you have is an eclipse plugin that is making use of the eclipse help system.
If what you are doing is relying on the lucene eclipse plugin, you may want to look at the
help system anyway since it will give you an example of an eclipse plugin that is using the
lucene plugin.

The eclipse help system uses lucene but they have their own Analyzer class that uses BreakIterator
to identify tokens for languages other than english and german. The lucene eclipse plugin
just exports the lucene jar and the html parser so that any plugin that depends on the lucene
plugin (like the help system) will have those jars in the classpath of their plugin.

For english they use the PorterStemFilter with a StopAnalyzer and a stopword list. For german,
they use the GermanAnalyzer supplied by the lucene jar.

In the latest CVS at :pserver:anonymous@dev.eclipse.org:/home/eclipse

see the project in org.eclipse.help.base/src/org/eclipse/help/internal/search
in older eclipse versions see the R2_1_maintenance branch of org.eclipse.help/src/org/eclipse/help/internal/search

the class DefaultAnalyzer is the analyzer implementation for languages other than english
and german and WordTokenStream is where they use BreakIterator to break the content from the
reader into individual tokens.

The default Eclipse help system sets these extensions in the org.eclipse.help.base plugin:

<!-- Text Analyzers for search -->
   <extension
         id="org.eclipse.help.base.Analyzer_en"
         point="org.eclipse.help.base.luceneAnalyzer">
      <analyzer
            locale="en"
            class="org.eclipse.help.internal.search.Analyzer_en">
      </analyzer>
   </extension>
   <extension
         id="org.eclipse.help.base.Analyzer_de"
         point="org.eclipse.help.base.luceneAnalyzer">
      <analyzer
            locale="de"
            class="org.apache.lucene.analysis.de.GermanAnalyzer">
      </analyzer>
   </extension>

Look at the extension point schema in http://dev.eclipse.org/viewcvs/index.cgi/~checkout~/org.eclipse.help.base/schema/luceneAnalyzer.exsd?rev=HEAD&content-type=text/plain
for how to declare your own analyzer extensions. Beware though, I read that this affects all
help searches in that language, not just the ones for your plugin.

Also, since the WordTokenStream is in a package with "internal" in its path, you aren't supposed
to ever make use of that class from other plugins, so if you wanted your own analyzer based
on that class and a stop list, you shouldn't use that class without talking the eclipse help
developers into moving it outside of an internal package.

Most of this has been around for a while, so it is probably the same or very similar in previous
eclipse versions, you may need to poke around at the extension point schema in your eclipse
plugins directory to verify that the extension point works the same way in your version of
eclipse. I haven't used it in versions prior to 3.0M8

Hope this is useful to you,
Eric

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Saturday, April 24, 2004 10:18 AM
To: Lucene Users List
Subject: Re: languages supported by lucene 1.2.1 in eclipse help system


That's no myth :)
Core Lucene (even the current version) does not include classes that know how to analyze/tokenize
text in languages other than English, Russian, and German.  However, take a look at the Snowball
contributions in Lucene Sandbox, where a few more analyzers are available, including those
for CJK group of langauges.

Otis


--- Jason Elliott <jason.elliott@peregrine.com> wrote:
> We have a plugin in our eclipse project named org.apache.lucene_1.2.1.
> It works quite well in that help system.
>  
> I've been notified that this particular version of the lucene search 
> analyzer searches well in German and English (GE), but not so well in 
> the rest of the languages on this planet.
>  
> I have several questions
> 1.	If it does not search very "well" in French, Italian and Japanese
> (FIJ), what does that really mean to a user conducting searches?
> a.	If this is a myth and the searches work the same in EFIG-J, please
> let me know that.
> b.	If this is not a myth and there are plugins that enable the search
> to work well in FIJ?
>  
> Thanks
> jason
>  
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message