lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joey Lawrance <>
Subject CJK Support for HTMLParser.jj
Date Mon, 23 Aug 2004 11:46:13 GMT
So, I needed the ability to parse Japanese HTML documents using 
lucene-ja for my job. I was frustrated when I got HTML parser errors on 
valid Japanese HTML. I digged a little, and I was excited to see the 
StandardTokenizer.jj grammar already had CJK ranges defined in it. I  
copied/pasted the CJK ranges from StandardTokenizer.jj into 
HTMLParser.jj and added CJK as a type of token and viola! I can now 
parse Japanese HTML documents using lucene-ja. Believe me, lucene-ja is 
very crippled without this ability!

I've attached the HTMLParser.jj file that successfully parses Japanese 
HTML for indexing. It is derived from the lucene-1.4-final version of 
HTMLParser.jj, and I've attached a patch (against lucene-1.4-final). 
Obviously, I don't have CVS commit access (and I'm not requesting it), 
but I'd like to contribute this patch back to Lucene as it has been 
absolutely invaluable for my work, and this is my way of saying "thank 
you!" Let me know if a patch against CVS would be more convenient, or if 
this patch is even worthy of being included in Lucene. I certainly think 
it is. :-)


View raw message