pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-4542) Suggestion: Don't load large streams completely into memory, reference them instead
Date Mon, 20 May 2019 07:53:00 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843743#comment-16843743
] 

Jonathan edited comment on PDFBOX-4542 at 5/20/19 7:52 AM:
-----------------------------------------------------------

I do think "parse on demand" is preferable, but I reckon it's a much steeper step than just
lazily loading streams. What do you think about implementing this as "opt-in"? In order to
work this requires a file path to the pdf being parsed anyway, so one could implement that
if this reference is set explicity, then the referenced streams will be used, otherwise it
would just use the default behaviour.

This change doesn't solve all problems concerning memory consumption, depending on the pdf
the maps from the dictionaries can also play a large role in memory consumption. But in particular
for simple image-container-pdfs, such as scans, this has a large impact .

Regarding encryption, I thought about just passing the handler to the stream, so then we'd
also need to pass object number and generation. I didn't find the code regarding whether the
object has been decypted before. I still do think it'd be possible, but there is no harm in
just ignoring this for now.


was (Author: rahn2):
I do think "parse on demand" is preferable, but I reckon it's a much steeper step than just
lazily loading streams. What do you think about implementing this as "opt-in"? This requires
a file path to the pdf being parsed anyway, so one could implement that if this reference
is set explicity, then it will use the referenced streams, otherwise it would just use the
default behaviour.

This change doesn't solve all problems concerning memory consumption, depending on the pdf
the Maps from the dictionaries can also be a large part of memory consumption. But in particular
for simple image-container-pdfs, such as scans, this has a large impact .

Regarding encryption, I thought about just passing the handler to the stream, so then we'd
also need to parse object number and generation. I didn't find the code regarding whether
the object has been decypted before. I still do think it'd be possible, but there is no harm
in just ignoring this for now.

> Suggestion: Don't load large streams completely into memory, reference them instead
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-4542
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4542
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, PDModel
>    Affects Versions: 2.0.14
>            Reporter: Jonathan
>            Priority: Minor
>              Labels: Memory, memory, performance
>
> As we processed large PDF files, many of which containing large image streams, we wanted
to avoid loading the entire streams into memory. Instead, we implemented a mechanism that
merely referenced their location on disk.
> We eventually did this by subclassing COSStream, and then overriding COSParser.parseCOSStream(COSDictionary)
to conditionally create our stream. Here is the code, this is currently still a work-in-progress.
I've just refactored the entire mechanism.
> {code:java}
> public class ReferencedCOSStream
>    extends COSStream
> {
>    //~ Instance members ------------------------------------------------------------------------------------------------------------------------------
>    boolean isReference = false;
>    File    reference   = null;
>    long    offset      = -1;
>    long    length      = -1;
>    //~ Constructors ----------------------------------------------------------------------------------------------------------------------------------
>    private ReferencedCOSStream(final ScratchFile scratchFile)
>    {
>       super(scratchFile);
>    }
>    //~ Methods ---------------------------------------------------------------------------------------------------------------------------------------
>    public static ReferencedCOSStream createFromCOSStream(final COSStream stream)
>    {
>       final ReferencedCOSStream out = new ReferencedCOSStream(stream.getScratchFile());
>       for (final Map.Entry<COSName, COSBase> entry : stream.entrySet())
>       {
>          out.setItem(entry.getKey(), entry.getValue());
>       }
>       return out;
>    }
>    @Override
>    public COSInputStream createInputStream(final DecodeOptions options)
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          final InputStream in = new SlicedFileInputStream(this.reference, this.offset,
this.length);
>          return COSInputStream.create(getFilterList(), this, in, this.getScratchFile(),
options);
>       }
>       else
>       {
>          return super.createInputStream(options);
>       }
>    }
>    @Override
>    public InputStream createRawInputStream()
>       throws IOException
>    {
>       if (this.isReference)
>       {
>          return new SlicedFileInputStream(this.reference, this.offset, this.length);
>       }
>       else
>       {
>          return super.createRawInputStream();
>       }
>    }
>    @Override
>    public OutputStream createOutputStream(final COSBase filters)
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createOutputStream(filters);
>    }
>    @Override
>    public OutputStream createRawOutputStream()
>       throws IOException
>    {
>       this.isReference = false;
>       return super.createRawOutputStream();
>    }
>    public void setReference(final File file,
>                             final long offset,
>                             final long length)
>    {
>       this.isReference = true;
>       this.reference   = file;
>       this.offset      = offset;
>       this.length      = length;
>       this.setLong(COSName.LENGTH, length);
>    }
>    //~ Inner Classes ---------------------------------------------------------------------------------------------------------------------------------
>    private class SlicedFileInputStream
>       extends FileInputStream
>    {
>       //~ Instance members ---------------------------------------------------------------------------------------------------------------------------
>       private long       index;
>       private final long length;
>       //~ Constructors -------------------------------------------------------------------------------------------------------------------------------
>       public SlicedFileInputStream(final File file,
>                                    final long offset,
>                                    final long length)
>          throws FileNotFoundException, IOException
>       {
>          super(file);
>          this.length = length;
>          this.skip(offset);
>          this.index = 0;
>       }
>       //~ Methods ------------------------------------------------------------------------------------------------------------------------------------
>       @Override
>       public int available()
>          throws IOException
>       {
>          final long remaining = length - index;
>          if (remaining < 0)
>          {
>             return 0;
>          }
>          return (int)remaining;
>       }
>       @Override
>       public int read(final byte[] b)
>          throws IOException
>       {
>          final int remaining = this.available();
>          final int len       = (remaining < b.length) ? remaining :
b.length;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public int read(final byte[] b,
>                       final int    off,
>                       int          len)
>          throws IOException
>       {
>          final int remaining = this.available();
>          len   =  (remaining < len) ? remaining : len;
>          index += len;
>          if (len > 0)
>          {
>             return super.read(b, 0, len);
>          }
>          else
>          {
>             return -1;
>          }
>       }
>       @Override
>       public long skip(final long n)
>          throws IOException
>       {
>          index += n;
>          return super.skip(n);
>       }
>       @Override
>       public FileChannel getChannel()
>       {
>          throw new UnsupportedOperationException("Obtaining a FileChannel is
not supported because a correct offset cannot be ensured.");
>       }
>    }
> }
> {code}
> {code:java}
>    @Override
>    protected COSStream parseCOSStream(final COSDictionary dic)
>       throws IOException
>    {
>       /*
>        * This needs to be dic.getItem because when we are parsing, the underlying
object might still be null.
>        */
>       final COSNumber streamLengthObj = getLength(dic.getItem(COSName.LENGTH), dic.getCOSName(COSName.TYPE));
>       COSStream       stream          = document.createCOSStream(dic);
>       // read 'stream'; this was already tested in parseObjectsDynamically()
>       readString();
>       skipWhiteSpaces();
>       if (streamLengthObj == null)
>       {
>          if (isLenient)
>          {
>             LOG.warn("The stream doesn't provide any stream length, using
fallback readUntilEnd, at offset " + source.getPosition());
>          }
>          else
>          {
>             throw new IOException("Missing length for stream.");
>          }
>       }
>       if ((streamLengthObj != null) && (streamLengthObj.longValue() >=
1024))
>       {
>          final long                streamBegPos = source.getPosition();
>          final ReferencedCOSStream refStream    = ReferencedCOSStream.createFromCOSStream(stream);
>          try
>          {
>             readValidStream(null, streamLengthObj);
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>          refStream.setReference(new File(reference), streamBegPos, source.getPosition()
- streamBegPos);
>          stream = refStream;
>       }
>       else
>       {
>          try(final OutputStream out = stream.createRawOutputStream())
>          {
>             if ((streamLengthObj != null) && validateStreamLength(streamLengthObj.longValue()))
>             {
>                readValidStream(out, streamLengthObj);
>             }
>             else
>             {
>                readUntilEndStream(new EndstreamOutputStream(out));
>             }
>          }
>          finally
>          {
>             stream.setItem(COSName.LENGTH, streamLengthObj);
>          }
>       }
>       final String endStream = readString();
>       if (endStream.equals("endobj") && isLenient)
>       {
>          LOG.warn("stream ends with 'endobj' instead of 'endstream' at offset
" + source.getPosition());
>          // avoid follow-up warning about missing endobj
>          source.rewind(ENDOBJ.length);
>       }
>       else if ((endStream.length() > 9) && isLenient && endStream.substring(0,
9).equals(ENDSTREAM_STRING))
>       {
>          LOG.warn("stream ends with '" + endStream + "' instead of 'endstream'
at offset " + source.getPosition());
>          // unread the "extra" bytes
>          source.rewind(endStream.substring(9).getBytes(ISO_8859_1).length);
>       }
>       else if (!endStream.equals(ENDSTREAM_STRING))
>       {
>          throw new IOException("Error reading stream, expected='endstream' actual='"
+ endStream + "' at offset " + source.getPosition());
>       }
>       return stream;
>    }
> {code}
> The class ReferencedCOSStream exposes the underlying data in exactly the same way as
it does COSStream, but instead of keeping the storage in memory, it always opens a FileInputStream
to retrieve the content. SlicedFileInputStream basically wraps around a FileInputStream and
tries to imitate the behaviour of an InputStream for this specific chunk of data.
> I needed to expose some APIs for these classes, the method ReferencedCOSStream.createFromCOSStream(COSStream)
would better be located in PDDocument and create the stream directly, I just didn't want to
also modify PDDocument.
> Right now, encrypted streams are currently loaded into memory by the SecurityHandler
directly after creation. If you want to accept this proposal, it might make sense to move
the decryption handling also into COSStream and ReferencedCOSStream and perform it upon request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message