crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Preventing Cleanup of PCollections
Date Sun, 04 Oct 2015 03:34:24 GMT
On Saturday, October 3, 2015, Everett Anderson <everett@nuna.com> wrote:

>
>
> On Thu, Oct 1, 2015 at 9:28 PM, Josh Wills <jwills@cloudera.com
> <javascript:_e(%7B%7D,'cvml','jwills@cloudera.com');>> wrote:
>
>> So that approach (hacky as it is) will work, and is really the only
>> obvious way that the planner can know which PCollections should be kept
>> around and which ones are okay to delete. I would expect it to work
>> indefinitely in future versions, and I'm always open to API enhancements
>> that make this sort of logic easier to express.
>>
>
> Two more questions --
>
> 1) In general in the Crunch programming model, should references to
> collections  remain viable across calls to run()?
>

Yes, although some recomputation may happen depending on which PCollections
were materialized/written on the previous run.


>
> 2) How does this solution relate to something like
>     table.cache(CachingOptions.builder().useDisk(true).build());
>
> ?
>
> Somehow using cache() seems natural, here, but currently in the MRPipeline
> I think cache() has maybe 3 branches depending on the input table, and one
> of them results in an intermediate output in the regular temp directory.
>

Yeah, cache() in MR is really a shorthand for materialize(). The
CachingOptions only kick in when there is some flexibility in the caching
mechanism (e.g., for Spark.)


>
>
>
>
>
>>
>> J
>>
>> On Thu, Oct 1, 2015 at 3:28 PM, Everett Anderson <everett@nuna.com
>> <javascript:_e(%7B%7D,'cvml','everett@nuna.com');>> wrote:
>>
>>> (Context: This is related to the 'LeaseExpiredExceptions and temp side
>>> effect files' thread.)
>>>
>>> In particular, the workaround would mean that we'd keep using the same
>>> PCollection/PTable references after a call to run()/cleanup(), which feels
>>> weird.
>>>
>>> Example:
>>>
>>> PTable liveTable = ...
>>> liveTable = liveTable.parallelDo(...)
>>>
>>> // Write the table somewhere we know won't get cleaned up,
>>> // which changes its internal Target.
>>> liveTable.write(To.sequenceFile(tempPath),
>>>                 Target.WriteMode.CHECKPOINT);
>>>
>>> // Call run() and cleanup() to flush old temporary data.
>>> pipeline.run();
>>> pipeline.cleanup(false);
>>>
>>> // Keep using liveTable since we know it'll work under the
>>> // covers because its Target is a sequence file that wasn't
>>> // cleaned up.
>>> liveTable = liveTable.parallelDo(...)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 1, 2015 at 10:54 AM, Jeff Quinn <jeff@nuna.com
>>> <javascript:_e(%7B%7D,'cvml','jeff@nuna.com');>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Our crunch pipeline has suffered from ballooning HDFS usage which
>>>> spikes during the course of the job. Our solution has been to call
>>>> Pipeline.run() and Pipeline.cleanup() between the major operations, hoping
>>>> to achieve periodic "garbage collection" of the temporary outputs that are
>>>> produced during the course of the pipeline.
>>>>
>>>> The problem is some PCollections from one operation will need to be
>>>> used as input to subsequent operations, and cleanup() seems to blow away
>>>> ALL PCollections that have not been explicitly written to a target (from
>>>> reading the source, it seems to just blow away the pipeline temp directory).
>>>>
>>>> Our workaround has been to explicitly call .write on the PCollections
>>>> we know we will need across calls to run()/cleanup(). This seems to work
as
>>>> far as I can tell, but it feels hacky. Is there a better or more supported
>>>> way to handle this, and is this approach likely to fail in future crunch
>>>> versions?
>>>>
>>>> Thanks!
>>>>
>>>> Jeff
>>>>
>>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>>> may contain information that is confidential, proprietary in nature,
>>>> protected health information (PHI), or otherwise protected by law from
>>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>>> are not the intended recipient, you are hereby notified that any use,
>>>> disclosure or copying of this email, including any attachments, is
>>>> unauthorized and strictly prohibited. If you have received this email in
>>>> error, please notify the sender of this email. Please delete this and all
>>>> copies of this email from your system. Any opinions either expressed or
>>>> implied in this email and all attachments, are those of its author only,
>>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>>
>>>
>>> *DISCLAIMER:* The contents of this email, including any attachments,
>>> may contain information that is confidential, proprietary in nature,
>>> protected health information (PHI), or otherwise protected by law from
>>> disclosure, and is solely for the use of the intended recipient(s). If you
>>> are not the intended recipient, you are hereby notified that any use,
>>> disclosure or copying of this email, including any attachments, is
>>> unauthorized and strictly prohibited. If you have received this email in
>>> error, please notify the sender of this email. Please delete this and all
>>> copies of this email from your system. Any opinions either expressed or
>>> implied in this email and all attachments, are those of its author only,
>>> and do not necessarily reflect those of Nuna Health, Inc.
>>>
>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>
> *DISCLAIMER:* The contents of this email, including any attachments, may
> contain information that is confidential, proprietary in nature, protected
> health information (PHI), or otherwise protected by law from disclosure,
> and is solely for the use of the intended recipient(s). If you are not the
> intended recipient, you are hereby notified that any use, disclosure or
> copying of this email, including any attachments, is unauthorized and
> strictly prohibited. If you have received this email in error, please
> notify the sender of this email. Please delete this and all copies of this
> email from your system. Any opinions either expressed or implied in this
> email and all attachments, are those of its author only, and do not
> necessarily reflect those of Nuna Health, Inc.



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message