lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weiwei Wang <ww.wang...@gmail.com>
Subject Re: Recover special terms from StandardTokenizer
Date Sun, 13 Dec 2009 13:38:02 GMT
Thank you very much, Uwe, I found the problem.

2009/12/13 Uwe Schindler <uwe@thetaphi.de>

> MappingCharFilter definitely preserves the offsets from the original
> reader.
> Yo can verify that for your case with Lucene’s testcase
> TestMappingCharFilter in the source distribution @
> /src/test/org/apache/lucene/analysis/TestMappingCharFilter.java:
>
> public void test2to4() throws Exception {
>  CharStream cs = new MappingCharFilter( normMap, new StringReader( "ll" )
> );
>  TokenStream ts = new WhitespaceTokenizer( cs );
>  assertTokenStreamContents(ts, new String[]{"llll"}, new int[]{0}, new
> int[]
> {2});
> }
>
> So there is everything correct. I tried this test also with
> StandrdTokenizer
> instead of WhiteSpaceTokenizer - it works and asserts the correct offsets.
> You should debug through the incrementToken()/CharFilter calls and verify
> where your offsets change.
>
> I cannot help more.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > Sent: Sunday, December 13, 2009 12:51 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Recover special terms from StandardTokenizer
> >
> > LowercaseCharFilter is necessary, as in the MappingCharFilter we need to
> > provide a NormalizeCharMap. We lowercase the stream so as we only provide
> > lowercase maps in the NormalizeCharMap, e.g. we provide map
> > (c++-->cplusplus) instead of (c++-->cplusplus) and (C++-->cplusplus).
> >
> > C++ is only an example we want to fix, in the future we may add more such
> > special terms
> >
> > the code for LowercaseCharFilter is as follows:
> > package analysis;
> >
> > import java.io.IOException;
> > import java.io.Reader;
> >
> > import org.apache.lucene.analysis.BaseCharFilter;
> > import org.apache.lucene.analysis.CharReader;
> > import org.apache.lucene.analysis.CharStream;
> >
> >
> > public class LowercaseCharFilter extends BaseCharFilter
> > {
> >
> >     public LowercaseCharFilter(CharStream in)
> >     {
> >     super(in);
> >     }
> >
> >     public LowercaseCharFilter(Reader in)
> >     {
> >     super(CharReader.get(in));
> >     }
> >     @Override
> >     public int read() throws IOException
> >     {
> >     return Character.toLowerCase(input.read());
> >     }
> >     @Override
> >     public int read(char[] cbuf, int off, int len) throws IOException {
> >     int ret = input.read(cbuf, off, len);
> >     if(ret!=-1)
> >     {
> >         for(int i=off; i<off+ret; i++)
> >         cbuf[i] = Character.toLowerCase(cbuf[i]);
> >     }
> >     return ret;
> >     }
> > }
> >
> >
> > Currently RosaMappingCharFilter is inherited from MappingCharFilter and
> > nothing is changed(i was planning to override addOffCorrectMap to fix my
> > problem, but it didn't work)
> >
> >
> > 2009/12/13 Uwe Schindler <uwe@thetaphi.de>
> >
> > > I think your problem is theLowercaseCharFilter that does not pass
> > > correctOffset() to the underying CharFilter. Does it work better
> without
> > > your LowerCaseCharFilter (which is duplicate because there is already a
> > > LowerCaseFilter in the Tokenizer chain).
> > >
> > > As you are only looking for "c++", just also add a mapping for "C++"
> and
> > > you
> > > are done, why lowercasing all because of one char?
> > >
> > > And what's RosaMappingCharFilter? A pink one? *g*
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > Sent: Sunday, December 13, 2009 12:23 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Re: Recover special terms from StandardTokenizer
> > > >
> > > > thanks, Uwe.
> > > > Maybe i was not very clear. My situation is like this:
> > > > Analyzer:
> > > >    NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > >    RECOVERY_MAP.add("c++","cplusplus$");
> > > >     CharFilter filter = new LowercaseCharFilter(reader);
> > > >     filter = new RosaMappingCharFilter(RECOVERY_MAP,filter);
> > > >     StandardTokenizer tokenStream = new
> > > > StandardTokenizer(Version.LUCENE_30,
> > > > filter);
> > > >     tokenStream.setMaxTokenLength(maxTokenLength);
> > > >     TokenStream result = new StandardFilter(tokenStream);
> > > >     result = getStopFilter(result);
> > > >     result = new SnowballFilter(result, STEMMER);
> > > > Analyze c++c++, return
> > > > (0,9)  [cplusplus]
> > > > (10,19)  [cplusplus]
> > > > the two numbers in th**e brackets are offsets.
> > > >
> > > > So in the searching process when i want to hight the search keyword
> > c++
> > > > with
> > > > the same analyzer, exception will be thrown because the string i
> > stored
> > > > are
> > > > c++c++ not cpluspluscplusplus(actually, i should not change the
> > original
> > > > string when storing them, otherwise it will confuse the users).
> > > >
> > > > I hope the analyzer can give result like this
> > > > (0,3) [cplusplus]
> > > > (3,6) [cplusplus]
> > > > then the Hilighter will works fine.
> > > >
> > > > So how can I achieve this result?
> > > >
> > > > 2009/12/13 Uwe Schindler <uwe@thetaphi.de>
> > > >
> > > > > MappingCharFilter preserves the offsets in the stream *before*
> > > > filtering.
> > > > > So
> > > > > if you store the original string (without c++ replaced) in a stored
> > > > field
> > > > > you can highlight using the given offstes. The highlighter must use
> > > > again
> > > > > the same analyzer or use FastVectorHighlighter.
> > > > >
> > > > > -----
> > > > > Uwe Schindler
> > > > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > > > http://www.thetaphi.de
> > > > > eMail: uwe@thetaphi.de
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Weiwei Wang [mailto:ww.wang.cs@gmail.com]
> > > > > > Sent: Sunday, December 13, 2009 11:43 AM
> > > > > > To: java-user@lucene.apache.org
> > > > > > Subject: Re: Recover special terms from StandardTokenizer
> > > > > >
> > > > > > Problem solved. Now another problem comes.
> > > > > >
> > > > > >
> > > > > > As I want to use Highlighter in my system, the token offset
is
> > > > incorrect
> > > > > > after the MappingCharFilter is used.
> > > > > >
> > > > > > Koji, do you known how to fix the offset problem?
> > > > > >
> > > > > > On Sun, Dec 13, 2009 at 11:12 AM, Weiwei Wang
> > <ww.wang.cs@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > I use Luke to check the result and find only c exists as
a
> term,
> > no
> > > > > > > cplusplus found in the index
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Dec 13, 2009 at 10:34 AM, Weiwei Wang
> > > > > > <ww.wang.cs@gmail.com>wrote:
> > > > > > >
> > > > > > >> Thanks, Koji, I followed your advice and change my
analyzer as
> > > > shown
> > > > > > >> below:
> > > > > > >> NormalizeCharMap RECOVERY_MAP = new NormalizeCharMap();
> > > > > > >> RECOVERY_MAP.add("c++","cplusplus$");
> > > > > > >> CharFilter filter = new LowercaseCharFilter(reader);
> > > > > > >> filter = new MappingCharFilter(RECOVERY_MAP,filter);
> > > > > > >> StandardTokenizer tokenStream = new
> > > > > > StandardTokenizer(Version.LUCENE_30,
> > > > > > >> filter);
> > > > > > >> tokenStream.setMaxTokenLength(maxTokenLength);
> > > > > > >> TokenStream result = new StandardFilter(tokenStream);
> > > > > > >> result = new LowerCaseFilter(result);
> > > > > > >> result = new StopFilter(enableStopPositionIncrements,
result,
> > > > > stopSet);
> > > > > > >> result = new SnowballFilter(result, STEMMER);
> > > > > > >>
> > > > > > >> I use the same analyzer in the search side. As you
know, this
> > > > analyzer
> > > > > > can
> > > > > > >> token c++ as cplusplus, for this reason, it seems I
can search
> > c++
> > > > > with
> > > > > > >> the same analyzer because it is also tokenized as cplusplus.
> > > > > > >>
> > > > > > >> I tested it on as string c++c++, however, when i search
c++ on
> > the
> > > > > > built
> > > > > > >> index, nothing is returned.
> > > > > > >>
> > > > > > >>  I do not know what's wrong with my code. Waiting for
your
> > replay
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Fri, Dec 11, 2009 at 9:43 PM, Weiwei Wang
> > > > > > <ww.wang.cs@gmail.com>wrote:
> > > > > > >>
> > > > > > >>> Thanks, Koji
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Fri, Dec 11, 2009 at 7:59 PM, Koji Sekiguchi
> > > > > > <koji@r.email.ne.jp>wrote:
> > > > > > >>>
> > > > > > >>>> MappingCharFilter can be used to convert c++
to cplusplus.
> > > > > > >>>>
> > > > > > >>>> Koji
> > > > > > >>>>
> > > > > > >>>> --
> > > > > > >>>> http://www.rondhuit.com/en/
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> Anshum wrote:
> > > > > > >>>>
> > > > > > >>>>> How about getting the original token stream
and then
> > converting
> > > > c++
> > > > > > to
> > > > > > >>>>> cplusplus or anyother such transform. Or
perhaps you might
> > look
> > > > at
> > > > > > >>>>> using/extending(in the non java sense)
some other
> tokenized!
> > > > > > >>>>>
> > > > > > >>>>> --
> > > > > > >>>>> Anshum Gupta
> > > > > > >>>>> Naukri Labs!
> > > > > > >>>>> http://ai-cafe.blogspot.com
> > > > > > >>>>>
> > > > > > >>>>> The facts expressed here belong to everybody,
the opinions
> > to
> > > > me.
> > > > > > The
> > > > > > >>>>> distinction is yours to draw............
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On Fri, Dec 11, 2009 at 11:00 AM, Weiwei
Wang <
> > > > > ww.wang.cs@gmail.com>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>> Hi, all,
> > > > > > >>>>>>    I designed a ftp search engine based
on Lucene. I did a
> > few
> > > > > > >>>>>> modifications to the StandardTokenizer.
> > > > > > >>>>>> My problem is:
> > > > > > >>>>>>  C++ is tokenized as c from StandardTokenizer
and I want
> to
> > > > > recover
> > > > > > it
> > > > > > >>>>>> from
> > > > > > >>>>>> the TokenStream from StandardTokenizer
> > > > > > >>>>>>
> > > > > > >>>>>> What should I do?
> > > > > > >>>>>>
> > > > > > >>>>>> --
> > > > > > >>>>>> Weiwei Wang
> > > > > > >>>>>> Alex Wang
> > > > > > >>>>>> 王巍巍
> > > > > > >>>>>> Room 403, Mengmin Wei Building
> > > > > > >>>>>> Computer Science Department
> > > > > > >>>>>> Gulou Campus of Nanjing University
> > > > > > >>>>>> Nanjing, P.R.China, 210093
> > > > > > >>>>>>
> > > > > > >>>>>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > >
> --------------------------------------------------------------------
> > -
> > > > > > >>>> To unsubscribe, e-mail: java-user-
> > unsubscribe@lucene.apache.org
> > > > > > >>>> For additional commands, e-mail:
> > > java-user-help@lucene.apache.org
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Weiwei Wang
> > > > > > >>> Alex Wang
> > > > > > >>> 王巍巍
> > > > > > >>> Room 403, Mengmin Wei Building
> > > > > > >>> Computer Science Department
> > > > > > >>> Gulou Campus of Nanjing University
> > > > > > >>> Nanjing, P.R.China, 210093
> > > > > > >>>
> > > > > > >>> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Weiwei Wang
> > > > > > >> Alex Wang
> > > > > > >> 王巍巍
> > > > > > >> Room 403, Mengmin Wei Building
> > > > > > >> Computer Science Department
> > > > > > >> Gulou Campus of Nanjing University
> > > > > > >> Nanjing, P.R.China, 210093
> > > > > > >>
> > > > > > >> Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Weiwei Wang
> > > > > > > Alex Wang
> > > > > > > 王巍巍
> > > > > > > Room 403, Mengmin Wei Building
> > > > > > > Computer Science Department
> > > > > > > Gulou Campus of Nanjing University
> > > > > > > Nanjing, P.R.China, 210093
> > > > > > >
> > > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Weiwei Wang
> > > > > > Alex Wang
> > > > > > 王巍巍
> > > > > > Room 403, Mengmin Wei Building
> > > > > > Computer Science Department
> > > > > > Gulou Campus of Nanjing University
> > > > > > Nanjing, P.R.China, 210093
> > > > > >
> > > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > > > >
> > > > >
> > > > >
> --------------------------------------------------------------------
> > -
> > > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Weiwei Wang
> > > > Alex Wang
> > > > 王巍巍
> > > > Room 403, Mengmin Wei Building
> > > > Computer Science Department
> > > > Gulou Campus of Nanjing University
> > > > Nanjing, P.R.China, 210093
> > > >
> > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Weiwei Wang
> > Alex Wang
> > 王巍巍
> > Room 403, Mengmin Wei Building
> > Computer Science Department
> > Gulou Campus of Nanjing University
> > Nanjing, P.R.China, 210093
> >
> > Homepage: http://cs.nju.edu.cn/rl/weiweiwang
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Weiwei Wang
Alex Wang
王巍巍
Room 403, Mengmin Wei Building
Computer Science Department
Gulou Campus of Nanjing University
Nanjing, P.R.China, 210093

Homepage: http://cs.nju.edu.cn/rl/weiweiwang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message