hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Compressor setInput input permanence
Date Mon, 05 Dec 2011 06:51:21 GMT
Hi Tim,

My guess is that this contract isn't explicitly documented anywhere.
But the good news is that the set of implementors and users of this
API is fairly well contained.

I'd propose you do the following:
- Look for any dependent projects which use the Compressor API
directly. I know HBase does. I believe Avro does. Hive and Pig might.
Accumulo probably does. A google code search or github search for
"import org.apache.hadoop.io.compress" would probably give a pretty
exhaustive list.
- Throughout those, look at see if they all maintain the buffer
between setInput and compress.
- If so, file a JIRA to document this as part of the compression API
javadoc, and then we'll be more explicit about it from now on?


On Sat, Dec 3, 2011 at 12:18 AM, Tim Broberg <tbroberg@yahoo.com> wrote:
> The question is, how long can a Compressor count on the user buffer to stick around after
a call to setInput()?
> The Compressor object has a method, setInput whose inputs are an array reference, an
offset and a length.
> I would expect that this input would no longer be guaranteed to persist after the setInput
call returns.
> ...but in ZlibCompressor and SnappyCompressor, when there is no buffer room for len
bytes, the Compressor makes a copy of the reference, offset, and length, clears the needsInput
condition, and returns waiting for a call to compress() to unload the buffers through the
compressor. The Compressor implementations count on the data to persist after setInput returns
until compress() is called.
> So, the data persist after the call. Does all such data persist?
> In theory, could a Compressor avoid a copy by just collecting references to each input
user buffer passed in and then sending all these references to the compression library when
compress() is called?
> ...or do these user buffers get reused before that time?
> By keeping references to these buffers, am I preventing them from getting garbage collected
and potentially soaking up large amounts of memory?
> Where is the persistence of the contents of these user buffers supposed to be documented?
> TIA,
>     - Tim.

Todd Lipcon
Software Engineer, Cloudera

View raw message