lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Isakson" <Eric.Isak...@sas.com>
Subject RE: Bigram search (help!)
Date Wed, 23 Apr 2003 15:20:53 GMT
I had a similar problem but hadn't gotten around to tracking down the root cause. It only affected
queries with default operator AND. Are you doing a default AND on the query?

The CJKTokenizer is emitting a token at the end of the stream with an empty string for term
text sometimes. I worked around the problem by adding a StopFilter to my analyzer.

    public TokenStream tokenStream( String fieldName, Reader reader )
    {
        TokenStream result = new CJKTokenizer( reader );
		result = new StopFilter(result, new String[] {""}); // CJKTokenizer emitts a "" sometimes,
haven't been able to figure it out, so this is a workaround
        return result;
    }

Just curious what HTML parser did you switch to? I'm having trouble with the demo HTML parser
on Japanese content also.

Eric

-----Original Message-----
From: mchaput [mailto:mchaput@aw.sgi.com] 
Sent: Monday, April 21, 2003 4:56 PM
To: Lucene Users List
Subject: Bigram search (help!)


Hi all,

Well, I got around my previous problem by switching to a different HTML 
parser.

Now I have an even more subtle and frustrating problem! :(

I'm using Che Dong's CJKTokenizer/CJKAnalyzer to do bigram tokenizing of 
Japanese text, with an unpatched Lucene 1.3RC1.

The tokenizer is working, here's the debug output of the tokens as they 
go by (using a WinDVD help file as a test):

\u30ba\u30fc
\u30fc\u30e0
windvd
\u3067\u306f
\u4efb\u610f
\u610f\u306e
\u306e\u9078

The terms are showing up properly in the index (dumping the terms from 
the index shows the character pairs are there).

When I create a query with search string \u30ba\u30fc\u30e0 I get 
something reasonable:

contents:"\u30ba\u30fc \u30fc\u30e0 "
(class org.apache.lucene.search.PhraseQuery)

So far so good, *BUT*, searching for this query gives no results! As you 
can see from the token stream above, this query SHOULD work, but it doesn't.

I'm at a loss. Can anyone think of what might be going wrong?


-- 
                       |
Matt Chaput           |   A l i a s | W a v e f r o n t
Information Designer  |   210 King St. E. Toronto, ON, Canada M5A 1J7
mchaput@aw.sgi.com    |   (416) 874-8268
                       |
"A goddamned ray of sunshine all the goddamned time" --Sparkle Hayter


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message