lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi" <koji.sekigu...@m4.dion.ne.jp>
Subject RE: Highlighter apply to Japanese
Date Tue, 06 Sep 2005 09:16:26 GMT
Hi Chris,

Thank you for your info.
With CJKAnalyzer, the diagnosis are as follows:

	pos	start	end
	Inc	Ofst	Ofst
[Aa]	1	0	2
[aa]	1	1	3
[aB]	1	2	4
[BC]	1	3	5
[Cc]	1	4	6
[cD]	1	5	7
[Dd]	1	6	8
[dE]	1	7	9
[EF]	1	8	10
[FG]	1	9	11
[Gg]	1	10	12
[gH]	1	11	13
[Hh]	1	12	14
[hI]	1	13	15
[Ii]	1	14	16
[iJ]	1	15	17
[JK]	1	16	18
[Kk]	1	17	19
[kL]	1	18	20
[LM]	1	19	21
[Mm]	1	20	22
[mN]	1	21	23

<B>AaaBCcDdEFGgHhIiJKkLMmN</B>

CJKAnalyzer is producing TokenStream which is all overlap
Mark was pointed out.
But JapaneseAnalyzer is producing a stream of tokens
are not overlapped as I showed in my previous mail.

BTW, I couldn't find CJKHighlighter and CJKHighlighterAnalyzer in
sandbox...

Koji

> -----Original Message-----
> From: Chris Lu [mailto:chris.lu@gmail.com] 
> Sent: Tuesday, September 06, 2005 3:53 PM
> To: java-user@lucene.apache.org
> Subject: Re: Highlighter apply to Japanese
> 
> 
> Hi, Koji,
> 
> I had the same problem as you. This is because CJK's n-gram analysis
> is different from single character's.
> 
> My get around is to use CJKHighlighter and 
> CJKHighlightAnalyzer in sandbox.
> 
> -- 
> Chris Lu
> ------------
> Lucene Search RAD on Any Database
> http://www.dbsight.net
> 
> 
> On 9/5/05, Koji Sekiguchi <koji.sekiguchi@m4.dion.ne.jp> wrote:
> > Hi again,
> > 
> > I'm using highlighter to highlight terms in Japanese text,
> > but I cannot get preferable output.
> > 
> > If I use StandardAnalyzer or SnowballAnalyzer w/ English,
> > getBestFragment() returns preferable outputs:
> > 
> > Sample: (SnowballAnalyzer)
> > Text: A meeting will be held in the City Hall
> > TokenStream:
> > [a][meet][will][be][held][in][the][citi][hall]
> > Query Text: meet
> > Output: A <B>meeting</B> will be held in the City Hall
> > 
> > But if I use JapaneseAnalyzer, which is most popular Analyzer
> > in Japan to get TokenStream from Japanese text, to highlight
> > Japanese text with Highlighter, whole text is highlighted:
> > 
> > Sample: (JapaneseAnalyzer)
> > Text: AMeetingWillBeHeldInTheCityHall
> > TokenStream:
> > [A][Meeting][Will][Be][Held][In][The][City][Hall]
> > Query Text: Meeting
> > Output: <B>AMeetingWillBeHeldInTheCityHall</B>
> > 
> > Please note that I use alphabet to show the Text at second sample
> > because most users in this mailing list can read it, but in reality,
> > I used Japanese characters for the Text. And you'll see that
> > JapaneseAnalyzer,
> > which uses Japanese dictionary on background to extract tokens
> > from text stream, can recognize tokens and produce TokenStream.
> > But highlighter.getBestFragment() highlighted whole text.
> > 
> > Do I need to implement Fragmenter to highlight tokens correctly
> > for Japanese text?
> > 
> > Thanks in advance,
> > 
> > Koji
> > 
> > 
> > 
> > 
> > 
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > 
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message