hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Broberg <Tim.Brob...@exar.com>
Subject RE: Compressor setInput input permanence
Date Mon, 05 Dec 2011 17:35:35 GMT
Thanks for the response, Todd.

I'll crawl the trunk for compressor / decompressor references today. If you think of any other
versions that should be scanned, please chime in.

    - Tim.

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Sunday, December 04, 2011 10:51 PM
To: common-dev@hadoop.apache.org; Tim Broberg
Subject: Re: Compressor setInput input permanence

Hi Tim,

My guess is that this contract isn't explicitly documented anywhere.
But the good news is that the set of implementors and users of this
API is fairly well contained.

I'd propose you do the following:
- Look for any dependent projects which use the Compressor API
directly. I know HBase does. I believe Avro does. Hive and Pig might.
Accumulo probably does. A google code search or github search for
"import org.apache.hadoop.io.compress" would probably give a pretty
exhaustive list.
- Throughout those, look at see if they all maintain the buffer
between setInput and compress.
- If so, file a JIRA to document this as part of the compression API
javadoc, and then we'll be more explicit about it from now on?

-Todd


On Sat, Dec 3, 2011 at 12:18 AM, Tim Broberg <tbroberg@yahoo.com> wrote:
> The question is, how long can a Compressor count on the user buffer to stick around after
a call to setInput()?
>
> The Compressor object has a method, setInput whose inputs are an array reference, an
offset and a length.
>
> I would expect that this input would no longer be guaranteed to persist after the setInput
call returns.
>
> ...but in ZlibCompressor and SnappyCompressor, when there is no buffer room for len bytes,
the Compressor makes a copy of the reference, offset, and length, clears the needsInput condition,
and returns waiting for a call to compress() to unload the buffers through the compressor.
The Compressor implementations count on the data to persist after setInput returns until compress()
is called.
>
> So, the data persist after the call. Does all such data persist?
>
> In theory, could a Compressor avoid a copy by just collecting references to each input
user buffer passed in and then sending all these references to the compression library when
compress() is called?
>
> ...or do these user buffers get reused before that time?
>
> By keeping references to these buffers, am I preventing them from getting garbage collected
and potentially soaking up large amounts of memory?
>
> Where is the persistence of the contents of these user buffers supposed to be documented?
>
> TIA,
>     - Tim.



--
Todd Lipcon
Software Engineer, Cloudera

The information and any attached documents contained in this message
may be confidential and/or legally privileged.  The message is
intended solely for the addressee(s).  If you are not the intended
recipient, you are hereby notified that any use, dissemination, or
reproduction is strictly prohibited and may be unlawful.  If you are
not the intended recipient, please contact the sender immediately by
return e-mail and destroy all copies of the original message.

Mime
View raw message