lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@hatcher.net>
Subject Re: demo IndexHTML parser breaks unicode?
Date Sat, 25 Sep 2004 09:48:24 GMT
As for alternative HTML parsers, there are a few notable ones:

NekoHTML - Nutch uses it

JTidy - My <index> Ant task in the sandbox uses it

and HTMLParser

All of the above are surely far more battle-tested in production than 
Lucene's demo parser, and I'd be surprised if they did not correctly 
handle Unicode.

	Erik


On Sep 24, 2004, at 11:01 PM, Fred Toth wrote:

> Hi,
>
> Thanks for the tip, but that didn't work in my case. Presumably
> with this patch, and the changes in CVS, this makes the parser
> work with UTF-16. I can't really tell because the index appears
> now to be completely UTF-16 and I can't search for anything.
>
> My input is actually UTF-8 anyway, and if I patch all the streams
> to use UTF-8 instead of UTF-16, I get parser errors.
>
> So I'm stuck.
>
> Thanks for your help,
>
> Fred
>
> At 09:46 PM 9/24/2004, wallen@Cyveillance.com wrote:
>> In org.apache.lucene.demo.HTMLDocument you need to change the input 
>> stream
>> to use a different encoding.  Replace the fis with this:
>>
>> fis = new InputStreamReader(new FileInputStream(f), "UTF-16");
>>
>> -----Original Message-----
>> From: Fred Toth [mailto:ftoth@synernet.com]
>> Sent: Friday, September 24, 2004 9:25 PM
>> To: Lucene Users List
>> Subject: Re: demo IndexHTML parser breaks unicode?
>>
>>
>> Sorry, that didn't cure it.
>>
>> Again, anyone want to point me to the quickest replacement
>> HTML parser (that's unicode clean)?
>>
>> Thanks,
>>
>> Fred
>>
>> At 03:17 PM 9/24/2004, you wrote:
>> >On Friday 24 September 2004 19:58, Fred Toth wrote:
>> >
>> > > I've got unicode in my source HTML. In particular, within meta 
>> tags,
>> > > and it's getting broken by the indexer. Note that I'm not trying 
>> to
>> > > query on any of this, just store and retrieve document titles with
>> > > unicode characters.
>> >
>> >Please try again with the code from CVS, Christoph Goller committed 
>> a fix
>> >for this problem (at least I think it was this problem) 1-3 weeks 
>> ago.
>> >
>> >Regards
>> >  Daniel
>> >
>> >--
>> >http://www.danielnaber.de
>> >
>> >---------------------------------------------------------------------
>> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message