Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of kmoses@cs.duke.edu designates
 152.3.140.1 as permitted sender)
Message-ID: <50797ED6.2040209@cs.duke.edu>
Date: Sat, 13 Oct 2012 10:46:46 -0400
From: Kyle Moses <kmoses@cs.duke.edu>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:15.0) Gecko/20120907 Thunderbird/15.0.1
MIME-Version: 1.0
To: user@hadoop.apache.org
Subject: Re: Distributed Cache For 100MB+ Data Structure
References: <5076FE16.5040001@cs.duke.edu>
 <CABCYYb9CZPKQwnvYh+k8yr9CT_D+WREQurjjsOeQKEUCMrgLAg@mail.gmail.com>
In-Reply-To: 
 <CABCYYb9CZPKQwnvYh+k8yr9CT_D+WREQurjjsOeQKEUCMrgLAg@mail.gmail.com>
Content-Type: multipart/alternative;
 boundary="------------030508060507010602020004"

This is a multi-part message in MIME format.
--------------030508060507010602020004
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Chris,
Thanks for the suggestion on serializing the radix tree and your 
thoughts on the memory issue.  I'm planning to test a few different 
solutions and will post another reply if the results prove interesting.

Kyle

On 10/11/2012 1:52 PM, Chris Nauroth wrote:
> Hello Kyle,
>
> Regarding the setup time of the radix tree, is it possible to 
> precompute the radix tree before job submission time, then create a 
> serialized representation (perhaps just Java object serialization), 
> and send the serialized form through distributed cache?  Then, each 
> reducer would just need to deserialize during setup() instead of 
> recomputing the full radix tree for every reducer task.  That might 
> save time.
>
> Regarding the memory consumption, when I've run into a situation like 
> this, I've generally solved it by caching the data in a separate 
> process and using some kind of IPC from the reducers to access it. 
>  memcache is one example, though that's probably not an ideal fit for 
> this data structure.  I'm aware of no equivalent solution directly in 
> Hadoop and would be curious to hear from others on the topic.
>
> Thanks,
> --Chris
>
> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu 
> <mailto:kmoses@cs.duke.edu>> wrote:
>
>     Problem Background:
>     I have a Hadoop MapReduce program that uses a IPv6 radix tree to
>     provide auxiliary input during the reduce phase of the second job
>     in it's workflow, but doesn't need the data at any other point.
>     It seems pretty straight forward to use the distributed cache to
>     build this data structure inside each reducer in the setup() method.
>     This solution is functional, but ends up using a large amount of
>     memory if I have 3 or more reducers running on the same node and
>     the setup time of the radix tree is non-trivial.
>     Additionally, the IPv6 version of the structure is quite a bit
>     larger in memory.
>
>     Question:
>     Is there a "good" way to share this data structure across all
>     reducers on the same node within the Hadoop framework?
>
>     Initial Thoughts:
>     It seems like this might be possible by altering the Task JVM
>     Reuse parameters, but from what I have read this would also affect
>     map tasks and I'm concerned about drawbacks/side-effects.
>
>     Thanks for your help!
>
>


--------------030508060507010602020004
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Chris,<br>
      Thanks for the suggestion on serializing the radix tree and your
      thoughts on the memory issue.&nbsp; I'm planning to test a few
      different solutions and will post another reply if the results
      prove interesting.<br>
      <br>
      Kyle<br>
      <br>
      On 10/11/2012 1:52 PM, Chris Nauroth wrote:<br>
    </div>
    <blockquote
cite="mid:CABCYYb9CZPKQwnvYh+k8yr9CT_D+WREQurjjsOeQKEUCMrgLAg@mail.gmail.com"
      type="cite">Hello Kyle,
      <div><br>
      </div>
      <div>Regarding the setup time of the radix tree, is it possible to
        precompute the radix tree before job submission time, then
        create a serialized representation (perhaps just Java object
        serialization), and send the serialized form through distributed
        cache? &nbsp;Then, each reducer would just need to deserialize during
        setup() instead of recomputing the full radix tree for every
        reducer task. &nbsp;That might save time.</div>
      <div><br>
      </div>
      <div>Regarding the memory consumption, when I've run into a
        situation like this, I've generally solved it by caching the
        data in a separate process and using some kind of IPC from the
        reducers to access it. &nbsp;memcache is one example, though that's
        probably not an ideal fit for this data structure. &nbsp;I'm aware of
        no equivalent solution directly in Hadoop and would be curious
        to hear from others on the topic.</div>
      <div><br>
      </div>
      <div>Thanks,</div>
      <div>--Chris</div>
      <div>
        <br>
        <div class="gmail_quote">On Thu, Oct 11, 2012 at 10:12 AM, Kyle
          Moses <span dir="ltr">&lt;<a moz-do-not-send="true"
              href="mailto:kmoses@cs.duke.edu" target="_blank">kmoses@cs.duke.edu</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            Problem Background:<br>
            I have a Hadoop MapReduce program that uses a IPv6 radix
            tree to provide auxiliary input during the reduce phase of
            the second job in it's workflow, but doesn't need the data
            at any other point.<br>
            It seems pretty straight forward to use the distributed
            cache to build this data structure inside each reducer in
            the setup() method.<br>
            This solution is functional, but ends up using a large
            amount of memory if I have 3 or more reducers running on the
            same node and the setup time of the radix tree is
            non-trivial.<br>
            Additionally, the IPv6 version of the structure is quite a
            bit larger in memory.<br>
            <br>
            Question:<br>
            Is there a "good" way to share this data structure across
            all reducers on the same node within the Hadoop framework?<br>
            <br>
            Initial Thoughts:<br>
            It seems like this might be possible by altering the Task
            JVM Reuse parameters, but from what I have read this would
            also affect map tasks and I'm concerned about
            drawbacks/side-effects.<br>
            <br>
            Thanks for your help!<br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>

--------------030508060507010602020004--