poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Varner <philvar...@gmail.com>
Subject Re: excel text extraction
Date Tue, 05 Jan 2010 19:41:53 GMT
Thanks for the suggestions. Comments inline

>> 1) ExtractorFactory uses the ExcelExtractor rather than the
>> EventBasedExcelExtractor, which causes it to OOM for very large workbooks.
>>  I was wondering why this was and if it would be reasonable to change it.
>
> The default is to use the UserModel based ones, as they tend to be more
> accurate and more configurable. However, I don't see why we couldn't add a
> "boolean preferEventBased" flag to toggle this.
>
> That said, iirc we only have an event based extractor for .xls, so it might
> not make all that much difference given that all other files you throw at it
> will take loads of memory again :/

Yes, the event-based is only for xls.  However, I think the difference
is that an Excel doc has the potential to be very large (in my case,
65535x10) since they can be generated from another datasource, whereas
ppt and doc are usually human-created and are much smaller. I think
it's probably less common to have a word doc that took up multiple
gigs in memory, but easy to do with excel.  Anecdotally, my customer
has several XLS that cause OOMs and no ppt or doc that do so. No easy
answer here.

>> 2) Without an event-based extractor for OOXML workbooks, you can never
>> extract text from very large workbooks.  I implemented a hacky workaround to
>> read only the shared strings xml doc, but I was wondering if there was a
>> better way to do this or if there was any interest in polishing this into
>> something that could be part of POI.
>
> You could probably base something on XLSX2CSV which is largely event based

I got an OOM just loading the Package, so I don't think XLSX2CSV will work.

>> 3) QuickButCruddyTextExtractor doesn't extend POIOLE2TextExtractor,
>> and I was wondering if there was a reason why.
>
> It predates the extractor interface by quite a bit, so I'm guessing it was
> forgotten :/

I tried implementing it a few weeks ago, but there's a reason (now
forgotten) that QBCTE can't implement POTE.  I ran into a bug with it
and switched back to PowerpointTextExtractor, which has worked fine.

>
> If you do fancy knocking up some patches for any of this, that's be very
> much appreciated :)

Will do, once I'm sure they're stable.

--Phil


-- 

Machines might be interesting, but people are fascinating. -- K.P.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message