poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject [Bug 57840] [PATCH] Support for structured references with Excel tables.
Date Fri, 10 Jun 2016 01:24:02 GMT
https://bz.apache.org/bugzilla/show_bug.cgi?id=57840

--- Comment #12 from Greg Woolsey <greg.woolsey@gmail.com> ---
(In reply to Javen O'Neal from comment #11)
> Moved getTable cache from XSSFWorkbook to XSSFEvaluationWorkbook to reduce
> the frequency of problems with a stale cache in r1747615-r1747616.

This kills performance - the reason I moved it to XSSFWorkbook in the first
place was performance.

Parsing and evaluating formulas on a moderately sized workbook (one table with
6 columns and 50,00 rows, one computed column) and a sheet with a small table
of formulas referencing the data table (VLOOKUPs mostly) took over an hour on a
P5 with 2GB allocated to the VM.

Moving it to XSSFWorkbook reduced this to 6 minutes, which is still awful, the
the problem at that point was the auto-boxing of rownums in all the row methods
in XSSFSheet (15% of total time spent in Integer.compareTo).  Lots of
references via Google for performance tests with auto-boxing vs. primitives vs.
explicitly referenced and retained primitive wrappers (Integer/Long).

The table lookup was performed hundreds of thousands of times, as every cell
evaluation created a new instance of XSSFEvaluationWorkbook, causing the cache
to be rebuilt.  Even just doing a linear search, if there are 3 or 4 tables,
regardless of size, is hugely expensive because the XMLBeans package is
terribly slow for repeated lookups.

I tried to balance the inconvenience/problem of a stale cache with performance.

I think, but haven't verified yet, that the real performance issue is that when
evaluating all the cell formulas in a workbook, if multiple cells reference the
same expression, the expression isn't cached, but recalculated over and over
for all those cells.  This is probably worst with range functions, when a
column of cells do the same VLOOKUP over and over with different inputs against
the same range.  Especially if that range contains formula cells.

All that to say, moving this back is going to have an extreme performance cost,
and the real issues are deep.  They probably deserve their own separate issues:

1. formula evaluations that require access to XML elements/attributes
2. XSSFSheet rownum auto-boxing
3. Formula evaluation intermediate result caching

Eventually I'll try to capture just how many table lookups by name are done vs.
how many formulas reference the table, to see how bad it is.

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message