poi-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Javen O'Neal" <javenon...@gmail.com>
Subject Re: Fill and Border efficiency for XSSF format
Date Tue, 22 Sep 2015 20:54:42 GMT
This problem isn't limited to just fill and border cell styles, but
cell styles in general. I have a similar problem when creating a
workbook with many (identical) styles
Along the same note, creating a new workbook, then calling:

for (int i=0; i<1000; i++) {
  wb.createCellStyle()
}

And saving the workbook produces a workbook with 1000 styles, even
though the styles are all duplicates (easy to check) and never used (a
little more difficult/expensive to check). Opening, saving, and
closing the 1000 style workbook in Excel will consolidate the styles.
I believe the style consolidation is done when the file is saved, but
it's possible it's done when the file is opened in Excel. I'd need to
write some VBA code to verify that

In my POI application, I want to take a cell and change just the data
format of the cell style (either 0.00 or 0.000)
Let's say I'm starting with a workbook where all numbers are formatted
as 0.00 and I want to format a few columns as 0.000.
cell.getCellStyle().setDataFormat(format) will change all cells with
the same style to 0.000, so I'm left with a workbook with all 0.000.
cell.setCellStyle(cell.getCellStyle().clone().setDataFormat(format))
will change the data format for just the one cell, but creates a new
style for every cell.
What I'm doing is temporarily changing the cell style with the new
data format, checking if there is another cell style in the styles
table that matches the modified style, and if so setting the cell's
style to the matching cell style, otherwise clone the cell style.
Finally, revert the cell style back to the original data format. It
looks something like this:

CellStyle originalStyle = cell.getCellStyle();
DataFormat originalDataFormat = originalStyle.getDataFormat();
// temporarily modify originalStyle to be the desired style
originalStyle.setDataFormat(newDataFormat);
try {
    // search for existing styles that match the desired style
    CellStyle matchingStyle = null;
    for (CellStyle style : stylesSource) {
        // styles match and do not refer to the same style
        if (originalStyle.equals(style) && originalStyle != style) {
            matchingStyle = style;
            break;
        }
    }
    // only create a new style if there are no styles that match the
desired style
    if (matchingStyle != null) {
        cell.setCellStyle(matchingStyle);
    } else {
        cell.setCellStyle(originalStyle.clone());
    }
} finally {
    // revert originalStyle back to what it was before
    originalStyle.setDataFormat(originalDataFormat);
}

This consolidation is done when the styles are created rather than
when the workbook is written to file.
This seems like something that many users write into their own POI
applications if they modify cell styles and are concerned about file
size or memory limitations, but I'm not sure how to standardize this
as part of the POI API.

I agree that calling a method to consolidate existing cell styles
should be called explicitly, as it might invalidate any existing
CellStyle variables the user has access to. Might want to only make
this change when transforming the in-memory objects into XML to write
to file, and leave the in-memory objects unmodified as a result of
this consolidation. Thus, either the workbook holds a
"consolidateCellStylesOnWrite" state or the write method is passed an
optional boolean consolidateCellStyles parameter. The former seems
cleaner, though adds state into the workbook.

I'd be happy to help you write a fix.


On Tue, Sep 22, 2015 at 12:24 PM, Dominik Stadler
<dominik.stadler@gmx.at> wrote:
> Hi,
>
> I think the main reason this is not done currently is that nobody
> spent enough time to do this cleanly.
>
> Not sure if comparing the single items of CellStyle would be a good
> idea, as this is likely to change at some point with new features
> being added, so you would probably compare the XML representation, but
> this can quickly become a performance problem if done for each
> save-operation, so I would opt for something like this being an
> optional operation that the user of POI has to call. Many users care
> more about runtime of the application than the size of the resulting
> document!
>
> If you are interesting in working on something like this, please
> create a bug in our bugtracker and post patches there fore
> review/inclusion.
>
> Dominik.
>
> On Mon, Sep 21, 2015 at 3:12 PM, Murphy, Mark <murphymdev@metalexmfg.com> wrote:
>> POI 3.12
>>
>> In looking at the styles.xml file generated by POI, It appears that a lot of borders
and fills are created that are not used. Maybe POI needs a method to trim off those unused
fills and borders on save. This method would also be able to normalize the fills and borders
to remove duplicates and system colors (i.e. fgColor = 64 or bgColor = 65). I notice that
Excel does this when it saves a spreadsheet created by POI. Here is an example of the fills,
first as created by POI, second as saved by Excel (same spreadsheet):
>>
>> POI 3.12
>>        <fills count="9">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="64" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="64" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill>
>>                            <fgColor indexed="64" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <bgColor indexed="65" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> Excel 2010 Save As (same spreadsheet)
>>                 <fills count="4">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="gray125" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> Note, not even Excel was perfect as it left a duplicate fill (0, and 2). Borders
was even more dramatic POI generated 94, Excel saved 13. And now that I look at it, POI generated
42 styles while Excel kept 23. I understand that you are keeping all the fills borders and
styles as they are specified, and re-using them as you get to a final setting, but normalizing
and trimming seem to be what you are missing. That can also be applied to fonts and formats.
>>
>> Some additional thoughts, based on not yet looking at all the code. I can see three
steps to collecting all the garbage: Normalization, Duplicate removal, Orphan removal. Normalization
would involve setting default values, and removing system colors. On the fill above, the POI
generated fill would look like this after normalization:
>>
>>        <fills count="9">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>        </fills>
>>
>> Duplicate removal would remove the many duplicates left by normalization. This would
have to be carried back to the styles themselves and potentially to individual cells in the
spreadsheet that contain fill id's. After duplicate removal you would have something like
the following in the fills:
>>
>>        <fills count="4">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="none">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> And I just happen to know that my spreadsheet never uses fill "none" - 44, so that
could d be stripped out during orphan removal to bring the fill to:
>>
>>        <fills count="3">
>>               <fill>
>>                      <patternFill patternType="none" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="darkGray" />
>>               </fill>
>>               <fill>
>>                      <patternFill patternType="solid">
>>                            <fgColor indexed="44" />
>>                      </patternFill>
>>               </fill>
>>        </fills>
>>
>> I suspect that the same process could be used for borders, fonts, formats and styles
where the order of operation would be normalization and duplicate removal for fills, borders,
fonts, formats, then styles. This followed by orphan removal for styles then fills, borders,
fonts, and formats.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
> For additional commands, e-mail: user-help@poi.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Mime
View raw message