lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clas Rydergren" <cl...@hotmail.com>
Subject Re: Modify the StandardAnalyzer
Date Sat, 06 Sep 2003 11:15:09 GMT

Hi again,

incze I am not sure I follow you. Do you mean that I should index with the 
SimpleAnalyzer and searching with the StandardAnalyzer? That is not an 
option for me since SimpleAnalyzer do not index number/digits the way I 
would like to (however, the StandardAnalyzer does!)


I found that modifying the StandardTokenizer.jj would be possible in order 
to modify the StandardAnalyzer to my prefered behaviour. When finished 
modifying the .jj-file, I run it trough JavaCC to get the .java-files. 
However those .java-files are not possible to compile together with the 
current distribution because of problems with the exception handling (see  
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg01403.html )

Next try was to compile Lucene (nightly snapshot) from scratch using Ant. 
That failed with errors with the "class collosions":

    [javac] Compiling 128 source files to 
/home/clryd/lucene/jakarta-lucene/bin/classes
    [javac] 
/home/clryd/lucene/jakarta-lucene/bin/src/org/apache/lucene/analysis/standard/StandardTokenizer.java:14:

duplicate class: org.apache.lucene.analysis.standard.StandardTokenizer
    [javac] public class StandardTokenizer extends 
org.apache.lucene.analysis.Tokenizer implements StandardTokenizerConstants {
    [javac]        ^
    [javac] 
/home/clryd/lucene/jakarta-lucene/bin/src/org/apache/lucene/analysis/standard/StandardTokenizerTokenManager.java:5:

duplicate class: 
org.apache.lucene.analysis.standard.StandardTokenizerTokenManager
    [javac] public class StandardTokenizerTokenManager implements 
StandardTokenizerConstants
    [javac]        ^

So now my question is, where do I find a Lucene-verion which is possible to 
compile? Where do I find the source CVS för Lucene 1.2 r3, or something 
similar?

cheers
/Clas


> > Hi,
> >
> > I have been experimenting with Lucene for a few hours, and now I'm 
>looking
> > for a solution to this:
> >
> > When using the SimpleAnalyzer for indexing text, data like 
>www.hotmail.com
> > seem to be indexed as www, hotmail and com which mean that a search for
> > "hotmail" will return a record. This is the behavior I am looking for!
> > However, since SimpleAnalyzer do not index numbers by default, I would 
>like
> > to use the StandardAnalyzer. But, Standardanalyzer do not split the 
>input
> > stream at ".".
> >
> > Ideally I should propably make my own analyser, but that seems to be a 
>bit
> > complicated to me :(. Which is the simplest possible modification that I
> > need to make to the Lucene source to make the StandardAnalyzer split, 
>for
> > example web-addresses, at "." into separately indexed words?
> >
> > Can this be made by modifications to the StandardTokenizer.jj? How? What 
>is
> > the easiest way of getting such modification into the "compiled" Lucene? 
>Is
> > there a need for recompiling everything?
> >
> > Appreciate all help!
> >
> > regards
> > clas
>
>You can stack up the two analyzers, first run the simple then the standard
>on the poutput.
>
>incze
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

_________________________________________________________________
Tired of spam? Get advanced junk mail protection with MSN 8. 
http://join.msn.com/?page=features/junkmail


Mime
View raw message