lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin A. Burton" <>
Subject Re: How to index Windows' Compiled HTML Help (CHM) Format
Date Sun, 12 Dec 2004 02:10:48 GMT
Tom wrote:

>Does anybody know how to index chm-files? 
>A possible solution I know is to convert chm-files to pdf-files (there are
>converters available for this job) and then use the known tools (e.g.
>PDFBox) to index the content of the pdf files (which contain the content of
>the chm-files). Are there any tools which can directly grab the textual
>content out of the (binary) chm-files?
>I think chm-file indexing-support is really a big missing piece in the
>currently supported indexable filetype-collection (XML, HTML, PDF,
>MSWord-DOC, RTF, Plaintext). 
I believe its just a Microsoft .cab file with an index.html inside it... 
am I right?

just uncompress it.

The problem is that the HTML within them isn't any way NEAR standard and 
you can't really give them to the user in the UI...



Use Rojo (RSS/Atom aggregator).  Visit Ask me for an 
invite!  Also see #rojo if you want to chat.

Rojo is Hiring! -

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
Kevin A. Burton, Location - San Francisco, CA
       AIM/YIM - sfburtonator,  Web -
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message