lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mchaput <>
Subject Bigram search (help!)
Date Mon, 21 Apr 2003 20:56:09 GMT
Hi all,

Well, I got around my previous problem by switching to a different HTML 

Now I have an even more subtle and frustrating problem! :(

I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of 
Japanese text, with an unpatched Lucene 1.3RC1.

The tokenizer is working, here's the debug output of the tokens as they 
go by (using a WinDVD help file as a test):


The terms are showing up properly in the index (dumping the terms from 
the index shows the character pairs are there).

When I create a query with search string \u30ba\u30fc\u30e0 I get 
something reasonable:

contents:"\u30ba\u30fc \u30fc\u30e0 "

So far so good, *BUT*, searching for this query gives no results! As you 
can see from the token stream above, this query SHOULD work, but it doesn't.

I'm at a loss. Can anyone think of what might be going wrong?

Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7    |   (416) 874-8268
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message