poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 51160] New: [PATCH] Provide ability to produce very large xlsx files using the regular high level API with the default heap space of 64MB
Date Fri, 06 May 2011 13:26:14 GMT
https://issues.apache.org/bugzilla/show_bug.cgi?id=51160

             Bug #: 51160
           Summary: [PATCH] Provide ability to produce very large xlsx
                    files using the regular high level API with the
                    default heap space of 64MB
           Product: POI
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: XSSF
        AssignedTo: dev@poi.apache.org
        ReportedBy: ag@4js.com
    Classification: Unclassified


Created attachment 26964
  --> https://issues.apache.org/bugzilla/attachment.cgi?id=26964
Patch to apply from the root of the poi-3.6-20091214 source tree

In response to a discussion with Nick Burch in the user forum (see
http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-td4312784.html)
I am submitting an example implementation to provide support for very large
xlsx files via the regular high level API without requiring more memory that is
provided by the default heap space.

This text assumes knowledge of the thread mentioned above so that the problem
and its proposed solution are not repeated again here.

General remarks
-------------------------
1) The code is far from being ready for general availability but I am
submitting it to see if there is a chance that it may become part of the
general source tree. If yes, then I will spend more time on it and do what I
can to make it feature complete.
2) I made the patch on the 3.6 version because that is what we currently are
using. That of course is nonsense and I should have done it on the 3.8 beta2
but for having a first look, I thought that it wouldn't make a difference. If
for whatever reason that is inconvenient to you, please tell me and I will
apply the patch to the 3.8 version and make the necessary changes if any.

Description of the patch
----------------------------------
The patch adds a demo application PoiTest (This is the application mentioned in
the thread that was used to do the benchmarks) in
patched/src/examples/src/org/apache/poi/ss/examples.
This console program takes some command line arguments and produces sheets in
the formats xls. and xlsx using HSSF, XSSF and SXSSF. The SXSSF classes are new
(The 'S' in the name stands for "streaming)" and they are the main subject of
this proposal.
The program creates very simple workbooks (only one sheet, no styles, no
images, no merged regions, only numeric data) but the implementation handles
the mentioned items and string values too. I expect that many of your unit test
will fail but us this is already sufficient to create the most sophisticated
output we have.
The patch includes a shell script at the root "RunPoiTest.sh" which contains an
example invocation which creates a 300,000 x 33 sheet using SXSSF.
The patch introduces 4 source files in the
"ooxml/java/org/apache/poi/xssf/usermodel" package.
1) SXSSFWorkbook extends XSSFWorkbook.

This class overloads write() to do the "BigGridDemo" patching stuff (Had to
remove the "final" on the method in POIXMLDocument for that).
I tricked the class into creating SXSSFSheets (instead of XSSFSheets) in calls
to createSheet() and cloneSheet() by means of a virtual function
XSSFWorkbook.getWorksheetRelation(). It seems to work but it is done without
understanding your instantiation code at all.

2) SXSSFSheet extends XSSFSheet

This class overloads all methods that deal with row creation and management.
Instead of producing XSSFRow instances in calls to createRow() it returns
SXSSFRow instances.
I decided to make SXSSFRow implement the "Row" interface from scratch rather
then making it a subclass of XSSFRow. This causes the following problem:
Since XSSFSeet specializes some (or all?) functions from the interface Sheet
that return or take a parameter of type "Row" to "XSSFRow" I could not overload
those methods because SXSSFRow is not a subclass of XSSFRow as in the following
example: 
The signature of Sheet.createRow() in ss.usermodel.Sheet is "public Row
createRow(...)"
In xssf.usermodel.XSSFSheet is is "public XSSFRow createRow(..)"
Now in xssf.usermodel.SXSSFSheet I want to overload the function and write
"public Row createRow(...)" or perhaps even "public SXSSRow createRow()".
Neither is possible because you can only specialize but not generalize on
overloading a method in a sub class.

I solved the problem the wrong way. What I did was to remove the specialization
in the signatures of the function in XSSFSheet so that the signatures are
identical with the interface definitions.  As a result of this I had to add
casts in many places in classes in the xssf package. So far, this is fine
because a XSSFSheet may safely assume that "Row XSSFSheet.getRow(int index)"
will in reality return a SXSSFRow.

By the time I had worked my self through the unit test replacing XSSFRow
row=sheet.createRow() with "Row row=sheet.createRow()" and started on the
examples I realized that this had not been the smartest idea. If this were
version 1.0 of POI then that would have been fine but asking everybody to
change from "XSSFRow" to "Row" is perhaps not such a good idea. 
Nevertheless I did it that way for the moment it and we can't solve it any
other way, we can always resort to composition (implement SXSSFWorkbook and
SXSSFSheet form scratch) which is what I did on my local version here.
Incidentally, this also solves the problem that "final void write()" cannot be
overloaded.

The class uses an inner class "SheetDataWriter" which is inspired by the
"SpreadsheetWriter" class from the "BigGridDemo". Its functionality has been
reduced to the capability to serialize the "<sheetData>" containing the rows
only via calls to SheetDataWriter.writeRow(int rowIndex,Row row). 
Unlike the "BigGridDemo" which produces the entire file, this class produces a
document fragment file which is then injected into the "sheet" file produced by
XSSFWorkbook.write(). If you find this confusing then look at the function
SXSSFWorkbook.write(). 

3) SXSSFRow implements Row
This is a lightweight implementation of a Row with array storage for the cells. 
The serialization is done by its owning SXSSFSheet.
The function row.createCell() returns SXSSFCell instances described below.

3) SXSSFCell implements Cell
This is a lightweight implementation of a Cell with an array storage for the
cells. It can be made a lot lighter but this naive implementation is
sufficiently lean to provide a 5000 row cache in the default heap space without
any problems at all. I haven't tested the actual numbers but guessing from the
other naive models I played with earlier I would expect a cell to take per
average something between 70 to 100 byte which is close to what HSSF consumes.

I added "//TODO:" wherever there I was aware that I didn't implement something
or when I was unsure about something.

OK, I probably forgot some items but I can't think of anything else for the
moment.

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message