poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Fisher <dfis...@jmlafferty.com>
Subject Re: POI 3.7 beta 2 "new XSSFWorkbook" very slow
Date Fri, 01 Oct 2010 21:32:36 GMT
Hi Ryan,

Back in May we had some discussions about performance in the POI-Dev list. Bryce Alcock analyzed
the situation and had a conversation with the XMLBeans about the issue. It has to do with
some inefficient code within the XMLBeans. http://www.mail-archive.com/dev@xmlbeans.apache.org/msg02207.html

Basically there are places where XMLBeans causes POI to be O(N**2). XMLBeans group suggests
XMLCursor but I am not sure that is an option.

Here is Yegor's comment within Bryce's POI-Dev thread. http://www.mail-archive.com/dev@poi.apache.org/msg05922.html
- there are many emails in the thread.

This really is something I am very interested in improving, I just haven't found the time.
If people have some resources to apply to the problem we will find time to help.

Regards,
Dave

On Sep 30, 2010, at 11:55 AM, Ryan Skow wrote:

> 
> Converting to XLS results in the workbook being opened in less than a second using POI
3.7b3 so that is probably what I'll do as a workaround.  Thank you very much for that idea.
> 
> To hopefully help get this problem solved for others and for future reference, I have
included answers to the rest of the questions that were asked.
> 
> General information related to the conversion and file size:
> POI Version  File Format  File Size(bytes)  Open Time(seconds)
> 3.6             xlsx        1,501,533               28
> 3.6             xls         4,951,040                1
> 3.7b3           xlsx        1,501,533              600
> 3.7b3           xls         4,951,040                1
> 
> The workbook contains 8 sheets and the (columns,rows) are as follows:
> Sheet 0: (19,123) (mostly raw numeric/text values - inputs)
> Sheet 1: (13,124) (almost all formulas)
> Sheet 2: (33,1459) (almost all formulas)
> Sheet 3: (20,1459) (almost all formulas)
> Sheet 4: (19,131) (almost all formulas)
> Sheet 5: (28, 775) (almost all raw numeric values)
> Sheet 6: (11, 109) (almost all raw numeric values)
> Sheet 7: (33, 89) (mostly tables used in VLOOKUP)
> 
> Attributes of the Workbook:
>  * There are LOTS of VLOOKUP function calls.
>  * There are LOTS regular multiply/divide/add/subtract which use back referenced cells
in their calculations.
>  * There are LOTS of nested cell references.
>  * There are quite a few 'named' cells which are referred to throughout the various calculations.
>  * There are NO references to the 'SUM' function or other array related calculations
from what I can tell.
> 
> You are correct that the performance issue has nothing to do with INDIRECT - I removed
those references and the timings were still the same.
> 
> As a side note, I did figure out what was wrong with the formula values.  It turned out
to be a problem in MathX.java and its implementation of the MOD function.  It incorrectly
reported mod(13,12) to be 0.9999999999999991 instead of simply 1.  This threw off some of
the VLOOKUP calls since the MODs were being used as the index of the VLOOKUP table.  I have
submitted a patch including test case through Bugzilla:
> https://issues.apache.org/bugzilla/show_bug.cgi?id=50033
> 
> If additional information would be helpful, please let me know and I'll track it down.
> 
> Thanks!
> 
> On 9/30/2010 6:33 AM, Yegor Kozlov wrote:
>> How big is the problem workbook? Can you tell us its size in MB, the
>> number of sheets and the size of data matrix on each sheet. Every little
>> bit of information helps.
>> 
>> Can you obfuscate sensitive data? Without a sample file we can't do much
>> to help you.
>> 
>> I suspect the slowness may be related to the support for array formulas
>> which was introduced right after 3.6-FINAL. Does your workbook contain
>> array formulas? They look something like SUM(C11:C12*D11:D12).
>> Evaluation of array formulas is not yet fully implemented and it
>> explains why you got incorrect formula results.
>> 
>> Indirect evaluation of defined names is unlikely to affect performance.
>> It was an isolated fix in the implementation of INDIRECT.
>> 
>> Try to save the file as .xls and run the same code against HSSF. How
>> much is it faster?
>> 
>> Regards,
>> Yegor
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message