Return-Path: Mailing-List: contact poi-user-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list poi-user@jakarta.apache.org Received: (qmail 55581 invoked from network); 19 Jul 2003 16:04:23 -0000 Received: from unknown (HELO set.superlinksoftware.com) (66.35.175.110) by daedalus.apache.org with SMTP; 19 Jul 2003 16:04:23 -0000 Received: from rdu57-249-152.nc.rr.com ([66.57.249.152]) by set.superlinksoftware.com (JAMES SMTP Server 2.1) with SMTP ID 66; Sat, 19 Jul 2003 10:25:59 -0400 (EDT) User-Agent: Microsoft-Entourage/10.1.0.2006 Date: Sat, 19 Jul 2003 12:04:24 -0400 Subject: HELP WANTED or The merge is on! HSSF- now with improved performance for your pleasure! From: "Andrew C. Oliver" To: POI Developers List CC: POI Users List , Jakarta General List Message-ID: Mime-version: 1.0 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Status: O X-Status: X-Keywords: X-UID: 134 Everyone...Take a deep breath and read the next line closely before asking "how could you": WE'RE TALKING ABOUT THE HEAD NOT THE RELEASE 2.0 BRANCH ** 2.0 is very near ready for release. The HEAD is basically the start of 3.0!! (which we'll bite off a lot less this time, but isn't coming around for awhile) I had planned to do the whole merge myself and all, but there is a lot to change and much of it *you guys* did and I don't understand. This is project success, I can't hold it all in my head anymore and don't even know what some of it means at a low level anymore. Many unit tests are failing and we need to pull together and help get it all working. I consider this success on all points :-) So I'm merging the HSSF performance branch back into the head. Its going pretty smoothly with a few gotchas: 1. Until we get the configuration stuff (see below) working, the unit tests actually take *more memory* rather than less. You may need an -Xmx flag. The problem is that the presently (hold over from the back in the day when I was doing HSSF all by myself ) compile time optimizations are set by default for *big sheets*. You'd have this same problem in production if you tried it. 2. These aren't implemented: Shared formulas, String formulas, Booleans, Boolean errors 3. I busted: shifting/cloning rows, deleting rows. 4. I Need to re-put the javadoc in the ValueRecordsAggregate. It was there...now its not....not yet sure how but probably in one of the times/places it was merged I just deleted it...doh! I'll probably have to rewrite it because my linux desktop needs a new fan and I'm afraid to boot it. (I assume grinding noises are bad for heat) ==== The Plan ===== 1. So the Value records aggregate moving to primitive parallel arrays is the first step in the overall plan that I've tried to explain in several emails but never concisely in one place. 2. One of the first things we need to do is come up with a generic configuration mechanism. Meaning a "default" config system (cached properties file?) for all threads which can be overridden via a thread specific model. I suggest also a "guesser" which can figure out roughly how to optimize based on file length or maybe Workbook stream length. (not a science I know, but for 90% of everyone this will be good enough with the defaults and maybe the guesser). This will control the size of collection structures. Meaning if we have a file with no more than 10 columns per row, we can use that (not a biggy but it helps). If we know it will be 5 rows or 5000 rows we can use that. When we dimension the rows array, we'll give it x elements. This won't mean that the sheet with 5 rows won't work if we configure for 5000 (and vice versa) but it determines how much memory will be used and how effectively. The 5 row sheet will waste 4995 references with the wrong setting. The 5000 row sheet will cause thousands of array copies. Either situation should be avoided/avoidable. The main issue is how to get the properties into memory without reading a file every time (properties file/thread local/static maybe). The config should be over-riddable per HSSFWorkbook instance for smaller or larger workbooks read. How to keep this straightforward across several places in the code and give the users the ability to change it w/o hassle. 3. Once we get all the unit tests working and the config... We should continue to remove our object creation instances. I did it at the lowest level first -- we still create the objects we just don't hang on to them. We should work not to create them in the first place! This will be a lot of work because it will touch the very way "record" works. We can eliminate them going in and out! Then we'll fly! 4. Once we get the object creations killed we should work on a memory/file map system. For this to work we have to touch the lowest level (the POIFS code). This will require someone to become a very high-caliber low-level developer. I can't take this one on myself. We need someone who wants to step up to like the most challenging thing. Unfortunately, it will be hard to get funding for something here (just laying my cards out) as I can't explain it to clients very well (they touch HSSF, not POIFS). This won't be usable in all situations and you're going to be disappointed with the performance (but you can think I'm wrong for now) but it will be more efficient with memory... Its also WAY more work than you think. Once POIFS has memory/file mapping we can do it on up the line for HSSF/HWPF/etc. 5. Tetsuya had some ideas that he sent me but I've never had a chance to look at (yet) and I'm hoping he'll pipe up on the list now that he's more comfortable and discuss him with everyone and I can digest them more from the conversation. He's shy but I'm trying to bring him out ;-) ===== how it works ==== Value Records Aggregate no longer actually holds the value records. It ditches them immediately and puts them in a structure of parallel arrays. These arrays are sized as needed. private final static int DEFAULT_ROWS=10000; private final static int DEFAULT_COLS=256; List celltype = null; //array of HSSFCEll.CELL_TYPE_XXX tells us which //arrays to use List xfs = null; // array of style types. Index of XF record List numericcells = null; // numeric and Shared string indicies. // list of lists List formulaptgs = null; // array of arrays of PTGS List stringvals = null; // array of actual string/formula string vals IntList populatedRows = null; //indicies of populated rows int physCells; //physical number of cells The algorythm is kind of complicated...but it goes like this: Insert cell: If (cell.rownum > celltype.size()) { add the row number to populatedRows grow celltype, xfs, numericcells, formulaptgs,stringvals by 1. (each are a List of ArrayList, IntList or DoubleList with a length == number of rows) } ctRow = The list at element rownum of the cell to insert from celltype xfRow = The list at element rownum of the cell to insert from xfs If (cell.colnum > ctRow.size()) { grow ctRow, xfRow to size of cell.colnum +1 } If its a number: nmRow = numericcells.get(cell.rownum); //(nmRow is a DoubleList) nmRow.set(cell.colnum, cell.value); If its a string: nmRow = numericcells.get(cell.rownum); //(nmRow is a DoubleList) nmRow.set(cell.colnum, cell.lablesstindex); If its a formula: value goes in numericcells string value goes in stringvals ptgs go in formulaptgs Note that I've also cleaned up the boarder largely between the model and usermodel as a side effect! The great thing is the nastiness is kept in VRA because outside you use the iterator (in VRA) which knows how to traverse this. You can also user getters and setters of course which also know how.. By my benchmarks and those of others show this is not only more efficient with memory but is on average much faster. I also think that once stage 2 happens we'll REALLY fly because the garbage collector will be out of our way. I've also recently acquired more professional grade profilers which I'll use once the unit tests pass. I suggest the HSSFTests as the place to start (encapsulates most of the basic functionality and then some into a test suite). Please pipe up with your thoughts or suggestions. And if you are a lurker wanting to get involved this is the best time/place. We have a very clear objective, and its often easier to learn in a debugger with a big red bar of a unit test result than staring at the code wondering what you should do. Thanks, -Andy cc: general just in case any volunteers are interested. cc: I'll blog this too ;-) -- Andrew C. Oliver http://www.superlinksoftware.com/poi.jsp Custom enhancements and Commercial Implementation for Jakarta POI http://jakarta.apache.org/poi For Java and Excel, Got POI?