poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Javen O'Neal" <javenon...@gmail.com>
Subject Re: SSPerformanceTest: Is the FAQ still accurate?
Date Tue, 12 Apr 2016 12:02:58 GMT
Memory consumption and performance are a balancing act. POI adds data
structures on top of the XML beans that makes lookups faster, but at
the cost of duplicating the memory across multiple data structures.
Until we can read in an OOXML file, write data structures that can
fully capture the XML content, free the XML beans, and recreate the
XML beans on write, and do so without corrupting or losing
information, POI will be a high-memory consumer. Additionally, we're
using XMLBeans 2.6, an older (discontinued) library that may not be as
efficient as other XML libraries.

Also consider that other libraries that can read and write Microsoft
Office files support a different set of features and are
performance-optimized (with auxillary data structures) in certain
cases.

I hope that clears up some of the questions/concern you had. Feel free
to use memory management tools such as Hotspot to figure out where all
the memory is going (a lot of strings stored in the XML nodes, if I
remember correctly) and submit patches where you think we could be
doing a better job on memory consumption without sacrificing
performance.

On Tue, Apr 12, 2016 at 4:36 AM, Jack of Shadows
<somerandomlogin@gmail.com> wrote:
> Yes, that is understandable. However, in my tests memory usage to parse a
> file with 55000 rows is 1.5 GB -- isn't that a bit too high?
> I've tested LibXL with the same file -- memory usage is just 240 MB.
>
> On Tue, Apr 12, 2016 at 2:09 PM, Murphy, Mark <murphymdev@metalexmfg.com>
> wrote:
>
>> XSSF is an XML document. Given that XML is generally about 70-80% overhead
>> vs. data, it is not surprising that binary spreadsheets (which can be
>> optimized, and have very little overhead) are more memory efficient. In
>> addition, XML must be parsed, but binary documents can frequently be
>> accessed using pointers and data structures. That gives the binary formats
>> a performance edge, which can be significant. I'm not sure how Microsoft
>> handles spreadsheets internally, but maybe they keep an internal binary
>> format, and then write it to whatever format is requested on save rather
>> than using an internal XML representation for an XML spreadsheet, which I
>> what POI is doing.
>>
>> -----Original Message-----
>> From: Jack of Shadows [mailto:somerandomlogin@gmail.com]
>> Sent: Monday, April 11, 2016 7:46 AM
>> To: POI Users List
>> Subject: Re: SSPerformanceTest: Is the FAQ still accurate?
>>
>> XSSF is basically unusable. 25000 or 50000 isn't that many rows. Memory
>> consumption is pretty high too.
>> That's really confusing, I wouldn't have been surprised if HSSF performed
>> poorly -- but it actually works better.
>> Ohh well, whatever, I guess I'd have to use SXSSF instead.
>>
>> On Mon, Apr 11, 2016 at 12:04 AM, Dominik Stadler <dominik.stadler@gmx.at>
>> wrote:
>>
>> > Hi,
>> >
>> > Not sure which exact machine spec the information in the FAQ is based
>> > on, maybe there is something that can have quite a big influence on
>> > runtime of this sample for XSSF, e.g. which actual JDK is used,
>> Linux/Windows, ... ?!
>> >
>> > I did a quick run of it across various versions of POI to see if we
>> > degraded performance at some point, but for me it rather was always
>> > this way, i.e. HSSF very quick, SXSSF fairly quick (with being very
>> > slow in early releases) and XSSF quite a bit slower, maybe we need to
>> > adjust the FAQ entry some more here to set correct expectations?
>> >
>> > (Exact numbers here are not that relevant as I used my 6+ year old
>> > laptop where I was doing other things at the same time, albeit no CPU
>> > intensive things, JVM was Sun 6.0, Linux Ubuntu, 25000 rows, 25 cols)
>> >
>> >
>> > latest-2016-04-10:
>> >
>> > Elapsed 2 seconds
>> > Elapsed 15 seconds
>> > Elapsed 5 seconds
>> >
>> >
>> > 2014-03-22 (the FAQ-Entry was added)
>> >
>> > Elapsed 1 seconds
>> > Elapsed 14 seconds
>> > Elapsed 3 seconds
>> >
>> >
>> > 3.10:
>> >
>> > Elapsed 2 seconds
>> > Elapsed 14 seconds
>> > Elapsed 3 seconds
>> >
>> >
>> > 3.9:
>> >
>> > Elapsed 1 seconds
>> > Elapsed 12 seconds
>> > Elapsed 3 seconds
>> >
>> >
>> > 3.8:
>> >
>> > Elapsed 2 seconds
>> > Elapsed 15 seconds
>> > Elapsed 3 seconds
>> >
>> >
>> > initial checkin of SSPerformanceTest:
>> >
>> > Elapsed 1 seconds
>> > Elapsed 14 seconds
>> > Elapsed 47 seconds
>> >
>> >
>> > Dominik.
>> >
>> >
>> >
>> >
>> > On Sun, Apr 10, 2016 at 5:59 PM, Jack <somerandomlogin@gmail.com> wrote:
>> >
>> > > I'm having the exact same issue, I've tracked down this message from
>> > > StackOverflow.
>> > > I've tested read performance on two XLS and XLSX with identical
>> > > content (around 75000 rows, 25 columns).
>> > > HSSF takes under 5 sec; XSSF takes 15-20 sec.
>> > >
>> > > Any idea what is the issue with XSSF performance?
>> > >
>> > >
>> > > On 15.02.2016 17:00, Drew Spencer wrote:
>> > >
>> > >> Mike DeHaan <mike <at> mikeandzoya.com> writes:
>> > >>
>> > >> As a followup, a user has replied to my stack overflow post with
>> > >> some
>> > >>> information that might be helpful in tracking this issue down.
>> > >>> Here is
>> > >>>
>> > >> the
>> > >>
>> > >>> link to his post:
>> > >>>
>> > >>> http://stackoverflow.com/a/34266795/4471563
>> > >>>
>> > >>> I ran the same tests in my environments and came up with similar
>> > >>>
>> > >> numbers.
>> > >>
>> > >>> -Mike DeHaan
>> > >>>
>> > >>> I have also asked the same question. Would love to get an answer
>> > >>> to
>> > this
>> > >> either way. My similar post on StackOverflow is here:
>> > >> http://stackoverflow.com/questions/34995058/apache-poi-much-quicker
>> > >> -
>> > >> using-hssf-than-xssf-what-next
>> > >>
>> > >> I received an good answer with the link to the streaming reader,
>> > >> but unfortunately I don't think I can use it because my code runs
>> > >> on app engine.
>> > >>
>> > >> Thanks to anyone that can help.
>> > >>
>> > >> Drew Spencer
>> > >>
>> > >>
>> > >> -------------------------------------------------------------------
>> > >> -- To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For
>> > >> additional commands, e-mail: user-help@poi.apache.org
>> > >>
>> > >>
>> > >>
>> > >
>> > > --------------------------------------------------------------------
>> > > - To unsubscribe, e-mail: user-unsubscribe@poi.apache.org For
>> > > additional commands, e-mail: user-help@poi.apache.org
>> > >
>> > >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
>> For additional commands, e-mail: user-help@poi.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message