poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Fisher <dfis...@jmlafferty.com>
Subject Re: POI 3.7 beta 2 "new XSSFWorkbook" very slow (MOD function)
Date Tue, 05 Oct 2010 19:53:13 GMT
Hi David,

> But, right or wrong; I was really just trying to ask the question:
> "shouldn't poi be trying to get the same result as Excel too?"


Now we are getting deep because for IEEE 754 it all depends. For example, POI runs on Java
and not all hardware even on the same OS produces the same results. For example, on Solaris
you do get differing results in C and Fortran when you are running on Sparc vs i386 CPUs.

http://developers.sun.com/solaris/articles/fp_errors.html

So, there are speed vs. precision vs. acceptable error tradeoffs here. Similar issue about
why COBOL and why FORTRAN.

We have a thread on the developer list where we are discussing an interface for plugging User
Defined Functions into POI's formula evaluation engine.

Maybe, we ought to consider alternative versions of standard functions. Or allow some type
of precision preference. Would you care to come over to the developer list and discuss the
thread titled "how to add functions to POI"

Regards,
Dave

On Oct 5, 2010, at 9:25 AM, David Law wrote:

> Hi Yegor,
> 
> I think I probably didn't express myself quite clearly there...
> ...I should have put that "right" in quotes, meaning
> that OpenOffice gets the /same/ result as Excel.
> 
> Its only "right" in the sense that its the same as Excel,
> but the Excel result may well be "wrong".
> 
> But, right or wrong; I was really just trying to ask the question:
> "shouldn't poi be trying to get the same result as Excel too?"
> 
> Aside:
> /"Confused? You won't be, after this week's episode of...Soap"
> /There, showing my age!!  Sorry, couldn't resist that. :-)
> 
> Best regards,
> DaveLaw
> 
> On 05.10.2010 17:29, Yegor Kozlov wrote:
>> Different products address rounding errors differently. The question is what to consider
the "correct" result.
>> 
>> Consider the following formula:
>>    =(0.5-0.4-0.1)
>> 
>> You may think it evaluates to 0.00 but it does not!
>> 
>> MS Office and POI evaluate it as -2.78E-17 and this is a correct result for 'strict'
floating-point math.
>> 
>> However, OpenOffice and Google Spreadsheet evaluate it to 0.00.
>> 
>> The equivalent formula SUM(0.5,-0.4,-0.1) results in the same values: -2.78E-17 in
MS Office and 0.00 in OpenOffice.
>> 
>> Some information about Excel's floating-point arithmetic can found at http://support.microsoft.com/kb/78113
and http://support.microsoft.com/kb/214118
>> 
>> Yegor
>> 
>>> Hi,
>>> 
>>> sorry its taken so long to post this, I've been "mulling over"...
>>> ...and I see the patch has been applied.
>>> 
>>> As Yegor mentions in the Bug report...
>>> mod(-3.4,2.0) being 0.60...01
>>> ...is due to binary arithmetic.
>>> 
>>> Now I just wondered how OpenOffice got it right and stumbled over
>>> some code to round to 15 digits in what seems to be the mod function.
>>> 
>>> Also, surely the aim here is to produce the same result as Excel,
>>> which I tried playing with a bit & seems to round an awful lot of
>>> things to just 14 digits, although it seems to be able to deal with
>>> values MUCH smaller than that.
>>> 
>>> Anyway, this was just intended to get a bit of discussion going on
>>> the matter, if there's anything to discuss that is.  :-)
>>> 
>>> Best regards,
>>> DaveLaw
>>> 
>>> Ryan Skow wrote:
>>>> 
>>>> Converting to XLS results in the workbook being opened in less than a second
using POI 3.7b3 so that is probably what I'll do as a workaround.  Thank you very much for
that idea.
>>>> 
>>>> To hopefully help get this problem solved for others and for future reference,
I have included answers to the rest of the questions that were asked.
>>>> 
>>>> General information related to the conversion and file size:
>>>> POI Version  File Format  File Size(bytes)  Open Time(seconds)
>>>> 3.6             xlsx        1,501,533               28
>>>> 3.6             xls         4,951,040                1
>>>> 3.7b3           xlsx        1,501,533              600
>>>> 3.7b3           xls         4,951,040                1
>>>> 
>>>> The workbook contains 8 sheets and the (columns,rows) are as follows:
>>>> Sheet 0: (19,123) (mostly raw numeric/text values - inputs)
>>>> Sheet 1: (13,124) (almost all formulas)
>>>> Sheet 2: (33,1459) (almost all formulas)
>>>> Sheet 3: (20,1459) (almost all formulas)
>>>> Sheet 4: (19,131) (almost all formulas)
>>>> Sheet 5: (28, 775) (almost all raw numeric values)
>>>> Sheet 6: (11, 109) (almost all raw numeric values)
>>>> Sheet 7: (33, 89) (mostly tables used in VLOOKUP)
>>>> 
>>>> Attributes of the Workbook:
>>>>  * There are LOTS of VLOOKUP function calls.
>>>>  * There are LOTS regular multiply/divide/add/subtract which use back referenced
cells in their calculations.
>>>>  * There are LOTS of nested cell references.
>>>>  * There are quite a few 'named' cells which are referred to throughout the
various calculations.
>>>>  * There are NO references to the 'SUM' function or other array related calculations
from what I can tell.
>>>> 
>>>> You are correct that the performance issue has nothing to do with INDIRECT
- I removed those references and the timings were still the same.
>>>> 
>>>> As a side note, I did figure out what was wrong with the formula values.
 It turned out to be a problem in MathX.java and its implementation of the MOD function. 
It incorrectly reported mod(13,12) to be 0.9999999999999991 instead of simply 1.  This threw
off some of the VLOOKUP calls since the MODs were being used as the index of the VLOOKUP table.
 I have submitted a patch including test case through Bugzilla:
>>>> https://issues.apache.org/bugzilla/show_bug.cgi?id=50033
>>>> 
>>>> If additional information would be helpful, please let me know and I'll track
it down.
>>>> 
>>>> Thanks!
>>>> 
>>>> On 9/30/2010 6:33 AM, Yegor Kozlov wrote:
>>>>> How big is the problem workbook? Can you tell us its size in MB, the
>>>>> number of sheets and the size of data matrix on each sheet. Every little
>>>>> bit of information helps.
>>>>> 
>>>>> Can you obfuscate sensitive data? Without a sample file we can't do much
>>>>> to help you.
>>>>> 
>>>>> I suspect the slowness may be related to the support for array formulas
>>>>> which was introduced right after 3.6-FINAL. Does your workbook contain
>>>>> array formulas? They look something like SUM(C11:C12*D11:D12).
>>>>> Evaluation of array formulas is not yet fully implemented and it
>>>>> explains why you got incorrect formula results.
>>>>> 
>>>>> Indirect evaluation of defined names is unlikely to affect performance.
>>>>> It was an isolated fix in the implementation of INDIRECT.
>>>>> 
>>>>> Try to save the file as .xls and run the same code against HSSF. How
>>>>> much is it faster?
>>>>> 
>>>>> Regards,
>>>>> Yegor
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>>>> For additional commands, e-mail: user-help@poi.apache.org
>>>> 
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>>> For additional commands, e-mail: user-help@poi.apache.org
>>> 
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message