Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.107 as permitted sender)
Message-ID: <BLU0-SMTP160267424EB12E6A6FFB75C8F730@phx.gbl>
From: Michael Segel <michael_segel@hotmail.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_2B059174-793E-48BE-8D00-FED668C1316F"
MIME-Version: 1.0 (Mac OS X Mail 6.1 \(1498\))
Subject: Re: Distributed Cache For 100MB+ Data Structure
Date: Sat, 13 Oct 2012 12:53:32 -0500
References: <5076FE16.5040001@cs.duke.edu>
 <CABCYYb9CZPKQwnvYh+k8yr9CT_D+WREQurjjsOeQKEUCMrgLAg@mail.gmail.com>
 <50797ED6.2040209@cs.duke.edu>
To: user@hadoop.apache.org
In-Reply-To: <50797ED6.2040209@cs.duke.edu>

--Apple-Mail=_2B059174-793E-48BE-8D00-FED668C1316F
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"

Build and store the tree in some sort of globally accessible space?=20

Like HBase, or HDFS?

On Oct 13, 2012, at 9:46 AM, Kyle Moses <kmoses@cs.duke.edu> wrote:

> Chris,
> Thanks for the suggestion on serializing the radix tree and your =
thoughts on the memory issue.  I'm planning to test a few different =
solutions and will post another reply if the results prove interesting.
>=20
> Kyle
>=20
> On 10/11/2012 1:52 PM, Chris Nauroth wrote:
>> Hello Kyle,
>>=20
>> Regarding the setup time of the radix tree, is it possible to =
precompute the radix tree before job submission time, then create a =
serialized representation (perhaps just Java object serialization), and =
send the serialized form through distributed cache?  Then, each reducer =
would just need to deserialize during setup() instead of recomputing the =
full radix tree for every reducer task.  That might save time.
>>=20
>> Regarding the memory consumption, when I've run into a situation like =
this, I've generally solved it by caching the data in a separate process =
and using some kind of IPC from the reducers to access it.  memcache is =
one example, though that's probably not an ideal fit for this data =
structure.  I'm aware of no equivalent solution directly in Hadoop and =
would be curious to hear from others on the topic.
>>=20
>> Thanks,
>> --Chris
>>=20
>> On Thu, Oct 11, 2012 at 10:12 AM, Kyle Moses <kmoses@cs.duke.edu> =
wrote:
>> Problem Background:
>> I have a Hadoop MapReduce program that uses a IPv6 radix tree to =
provide auxiliary input during the reduce phase of the second job in =
it's workflow, but doesn't need the data at any other point.
>> It seems pretty straight forward to use the distributed cache to =
build this data structure inside each reducer in the setup() method.
>> This solution is functional, but ends up using a large amount of =
memory if I have 3 or more reducers running on the same node and the =
setup time of the radix tree is non-trivial.
>> Additionally, the IPv6 version of the structure is quite a bit larger =
in memory.
>>=20
>> Question:
>> Is there a "good" way to share this data structure across all =
reducers on the same node within the Hadoop framework?
>>=20
>> Initial Thoughts:
>> It seems like this might be possible by altering the Task JVM Reuse =
parameters, but from what I have read this would also affect map tasks =
and I'm concerned about drawbacks/side-effects.
>>=20
>> Thanks for your help!
>>=20
>=20


--Apple-Mail=_2B059174-793E-48BE-8D00-FED668C1316F
Content-Transfer-Encoding: 7bit
Content-Type: text/html; charset="iso-8859-1"

<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Build and store the tree in some sort of globally accessible space?&nbsp;<div><br></div><div>Like HBase, or HDFS?</div><div><br><div><div>On Oct 13, 2012, at 9:46 AM, Kyle Moses &lt;<a href="mailto:kmoses@cs.duke.edu">kmoses@cs.duke.edu</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">
  
    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
  
  <div bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Chris,<br>
      Thanks for the suggestion on serializing the radix tree and your
      thoughts on the memory issue.&nbsp; I'm planning to test a few
      different solutions and will post another reply if the results
      prove interesting.<br>
      <br>
      Kyle<br>
      <br>
      On 10/11/2012 1:52 PM, Chris Nauroth wrote:<br>
    </div>
    <blockquote cite="mid:CABCYYb9CZPKQwnvYh+k8yr9CT_D+WREQurjjsOeQKEUCMrgLAg@mail.gmail.com" type="cite">Hello Kyle,
      <div><br>
      </div>
      <div>Regarding the setup time of the radix tree, is it possible to
        precompute the radix tree before job submission time, then
        create a serialized representation (perhaps just Java object
        serialization), and send the serialized form through distributed
        cache? &nbsp;Then, each reducer would just need to deserialize during
        setup() instead of recomputing the full radix tree for every
        reducer task. &nbsp;That might save time.</div>
      <div><br>
      </div>
      <div>Regarding the memory consumption, when I've run into a
        situation like this, I've generally solved it by caching the
        data in a separate process and using some kind of IPC from the
        reducers to access it. &nbsp;memcache is one example, though that's
        probably not an ideal fit for this data structure. &nbsp;I'm aware of
        no equivalent solution directly in Hadoop and would be curious
        to hear from others on the topic.</div>
      <div><br>
      </div>
      <div>Thanks,</div>
      <div>--Chris</div>
      <div>
        <br>
        <div class="gmail_quote">On Thu, Oct 11, 2012 at 10:12 AM, Kyle
          Moses <span dir="ltr">&lt;<a moz-do-not-send="true" href="mailto:kmoses@cs.duke.edu" target="_blank">kmoses@cs.duke.edu</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            Problem Background:<br>
            I have a Hadoop MapReduce program that uses a IPv6 radix
            tree to provide auxiliary input during the reduce phase of
            the second job in it's workflow, but doesn't need the data
            at any other point.<br>
            It seems pretty straight forward to use the distributed
            cache to build this data structure inside each reducer in
            the setup() method.<br>
            This solution is functional, but ends up using a large
            amount of memory if I have 3 or more reducers running on the
            same node and the setup time of the radix tree is
            non-trivial.<br>
            Additionally, the IPv6 version of the structure is quite a
            bit larger in memory.<br>
            <br>
            Question:<br>
            Is there a "good" way to share this data structure across
            all reducers on the same node within the Hadoop framework?<br>
            <br>
            Initial Thoughts:<br>
            It seems like this might be possible by altering the Task
            JVM Reuse parameters, but from what I have read this would
            also affect map tasks and I'm concerned about
            drawbacks/side-effects.<br>
            <br>
            Thanks for your help!<br>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div>

</blockquote></div><br></div></body></html>
--Apple-Mail=_2B059174-793E-48BE-8D00-FED668C1316F--