lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Re: index word files ( doc )
Date Mon, 26 Mar 2007 07:08:44 GMT
Ryan Ackley wrote:
> Yes I do have plans for adding fast save support and support for more
> file formats. The time frame for this happening is the next couple of
> months.

That would be good when it comes.  It would be nice if it could handle a 'brute 
force' mode where in the event of problems, it will just allow the text it can 
find to be extracted.  Currently if there is an Exception, I just run a raw 
strings parser on the file to fetch what I can.

One problem I found is that files not padded to 512 byte blocks cannot be 
parsed, but Words reads them happily.  They seem to be valid in other respects, 
i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the 
file to 512 byte boundary with nulls parses OK.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message