lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin A. Burton" <bur...@newsmonster.org>
Subject Re: How to index Windows' Compiled HTML Help (CHM) Format
Date Sun, 12 Dec 2004 02:10:48 GMT
Tom wrote:

>Hi,
>
>Does anybody know how to index chm-files? 
>A possible solution I know is to convert chm-files to pdf-files (there are
>converters available for this job) and then use the known tools (e.g.
>PDFBox) to index the content of the pdf files (which contain the content of
>the chm-files). Are there any tools which can directly grab the textual
>content out of the (binary) chm-files?
>
>I think chm-file indexing-support is really a big missing piece in the
>currently supported indexable filetype-collection (XML, HTML, PDF,
>MSWord-DOC, RTF, Plaintext). 
>  
>
I believe its just a Microsoft .cab file with an index.html inside it... 
am I right?

just uncompress it.

The problem is that the HTML within them isn't any way NEAR standard and 
you can't really give them to the user in the UI...

Kevin

-- 

Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
    
Kevin A. Burton, Location - San Francisco, CA
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message