Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm
User-Agent: Microsoft-Entourage/10.1.0.2006
Date: Sat, 19 Jul 2003 12:04:24 -0400
Subject: HELP WANTED or The merge is on!  HSSF- now with improved
	performance for your pleasure!
From: "Andrew C. Oliver" <acoliver@apache.org>
To: POI Developers List <poi-dev@jakarta.apache.org>
CC: POI Users List <poi-user@jakarta.apache.org>,
	Jakarta General List <general@jakarta.apache.org>
Message-ID: <BB3EE448.AEAE%acoliver@apache.org>
Mime-version: 1.0
Content-type: text/plain; charset="US-ASCII"
Content-transfer-encoding: 7bit
Status: O

Everyone...Take a deep breath and read the next line closely before asking
"how could you":
WE'RE TALKING ABOUT THE HEAD NOT THE RELEASE 2.0 BRANCH ** 2.0 is very near
ready for release.  The HEAD is basically the start of 3.0!!  (which we'll
bite off a lot less this time, but isn't coming around for awhile)

I had planned to do the whole merge myself and all, but there is a lot to
change and much of it *you guys* did and I don't understand.  This is
project success, I can't hold it all in my head anymore and don't even know
what some of it means at a low level anymore.  Many unit tests are failing
and we need to pull together and help get it all working.  I consider this
success on all points :-)

So I'm merging the HSSF performance branch back into the head.  Its going
pretty smoothly with a few gotchas:

1. Until we get the configuration stuff (see below) working, the unit tests
actually take *more memory* rather than less.  You may need an -Xmx flag.
The problem is that the presently (hold over from the back in the day when I
was doing HSSF all by myself ) compile time optimizations are set by default
for *big sheets*.  You'd have this same problem in production if you tried
it.

2. These aren't implemented: Shared formulas, String formulas, Booleans,
Boolean errors

3. I busted: shifting/cloning rows, deleting rows.

4. I Need to re-put the javadoc in the ValueRecordsAggregate.  It was
there...now its not....not yet sure how but probably in one of the
times/places it was merged I just deleted it...doh!  I'll probably have to
rewrite it because my linux desktop needs a new fan and I'm afraid to boot
it.  (I assume grinding noises are bad for heat)

==== The Plan =====

1. So the Value records aggregate moving to primitive parallel arrays is the
first step in the overall plan that I've tried to explain in several emails
but never concisely in one place.

2. One of the first things we need to do is come up with a generic
configuration mechanism.  Meaning a "default" config system (cached
properties file?) for all threads which can be overridden via a thread
specific model.  I suggest also a "guesser" which can figure out roughly how
to optimize based on file length or maybe Workbook stream length.  (not a
science I know, but for 90% of everyone this will be good enough with the
defaults and maybe the guesser).  This will control the size of collection
structures.  Meaning if we have a file with no more than 10 columns per row,
we can use that (not a biggy but it helps).  If we know it will be 5 rows or
5000 rows we can use that.  When we dimension the rows array, we'll give it
x elements.  This won't mean that the sheet with 5 rows won't work if we
configure for 5000 (and vice versa) but it determines how much memory will
be used and how effectively.  The 5 row sheet will waste 4995 references
with the wrong setting.  The 5000 row sheet will cause thousands of array
copies.  Either situation should be avoided/avoidable.

The main issue is how to get the properties into memory without reading a
file every time (properties file/thread local/static maybe).  The config
should be over-riddable per HSSFWorkbook instance for smaller or larger
workbooks read.  How to keep this straightforward across several places in
the code and give the users the ability to change it w/o hassle.

3.  Once we get all the unit tests working and the config...  We should
continue to remove our object creation instances.  I did it at the lowest
level first -- we still create the objects we just don't hang on to them.
We should work not to create them in the first place!  This will be a lot of
work because it will touch the very way "record" works.  We can eliminate
them going in and out!  Then we'll fly!

4.  Once we get the object creations killed we should work on a memory/file
map system.  For this to work we have to touch the lowest level (the POIFS
code).  This will require someone to become a very high-caliber low-level
developer.  I can't take this one on myself.  We need someone who wants to
step up to like the most challenging thing.  Unfortunately, it will be hard
to get funding for something here (just laying my cards out) as I can't
explain it to clients very well (they touch HSSF, not POIFS).  This won't be
usable in all situations and you're going to be disappointed with the
performance (but you can think I'm wrong for now) but it will be more
efficient with memory...  Its also WAY more work than you think.  Once POIFS
has memory/file mapping we can do it on up the line for HSSF/HWPF/etc.

5. Tetsuya had some ideas that he sent me but I've never had a chance to
look at (yet) and I'm hoping he'll pipe up on the list now that he's more
comfortable and discuss him with everyone and I can digest them more from
the conversation.  He's shy but I'm trying to bring him out ;-)

===== how it works ====

Value Records Aggregate no longer actually holds the value records.  It
ditches them immediately and puts them in a structure of parallel arrays.
These arrays are sized as needed.

    private final static int DEFAULT_ROWS=10000;
    private final static int DEFAULT_COLS=256;
    List celltype = null; //array of HSSFCEll.CELL_TYPE_XXX tells us which
                          //arrays to use
    List xfs      = null; // array of style types.  Index of XF record
    List numericcells = null; // numeric and Shared string indicies.
                              // list of lists
    List formulaptgs = null;  // array of arrays of PTGS
    List stringvals = null;   // array of actual string/formula string vals
    IntList populatedRows = null;  //indicies of populated rows
    int physCells;                 //physical number of cells

The algorythm is kind of complicated...but it goes like this:

Insert cell:

If (cell.rownum > celltype.size()) {
 add the row number to populatedRows
 grow celltype, xfs, numericcells, formulaptgs,stringvals by 1.
 (each are a List of ArrayList, IntList or DoubleList with a length ==
number of rows)
}

 ctRow = The list at element rownum of the cell to insert from celltype
 xfRow = The list at element rownum of the cell to insert from xfs

If (cell.colnum > ctRow.size()) {
 grow ctRow, xfRow to size of cell.colnum +1
}
 
If its a number:

 nmRow = numericcells.get(cell.rownum);
                //(nmRow is a DoubleList)
 nmRow.set(cell.colnum, cell.value);

If its a string:

 nmRow = numericcells.get(cell.rownum);
                //(nmRow is a DoubleList)
 nmRow.set(cell.colnum, cell.lablesstindex);
 
If its a formula:

 value goes in numericcells
 string value goes in stringvals
 ptgs go in formulaptgs

Note that I've also cleaned up the boarder largely between the model and
usermodel as a side effect!

The great thing is the nastiness is kept in VRA because outside you use the
iterator (in VRA) which knows how to traverse this.  You can also user
getters and setters of course which also know how..

By my benchmarks and those of others show this is not only more efficient
with memory but is on average much faster.  I also think that once stage 2
happens we'll REALLY fly because the garbage collector will be out of our
way.

I've also recently acquired more professional grade profilers which I'll use
once the unit tests pass.

I suggest the HSSFTests as the place to start (encapsulates most of the
basic functionality and then some into a test suite).

Please pipe up with your thoughts or suggestions.  And if you are a lurker
wanting to get involved this is the best time/place.  We have a very clear
objective, and its often easier to learn in a debugger with a big red bar of
a unit test result than staring at the code wondering what you should do.

Thanks,

-Andy

cc: general just in case any volunteers are interested.
cc: I'll blog this too ;-)
-- 
Andrew C. Oliver
http://www.superlinksoftware.com/poi.jsp
Custom enhancements and Commercial Implementation for Jakarta POI

http://jakarta.apache.org/poi
For Java and Excel, Got POI?