lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Ackley" <ryanack...@gmail.com>
Subject Re: index word files ( doc )
Date Mon, 26 Mar 2007 09:43:06 GMT
The 512 byte thing is a limitation of POIFS I think. I could be wrong
though. Have you tried opening the file with just POIFS?

On 3/26/07, Antony Bowesman <adb@teamware.com> wrote:
> Ryan Ackley wrote:
> > Yes I do have plans for adding fast save support and support for more
> > file formats. The time frame for this happening is the next couple of
> > months.
>
> That would be good when it comes.  It would be nice if it could handle a 'brute
> force' mode where in the event of problems, it will just allow the text it can
> find to be extracted.  Currently if there is an Exception, I just run a raw
> strings parser on the file to fetch what I can.
>
> One problem I found is that files not padded to 512 byte blocks cannot be
> parsed, but Words reads them happily.  They seem to be valid in other respects,
> i.e. they have the 1Table, Root Entry and other recognisable parts.  Padding the
> file to 512 byte boundary with nulls parses OK.
>
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message