Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2F4199307 for ; Thu, 12 Apr 2012 16:58:53 +0000 (UTC) Received: (qmail 70054 invoked by uid 500); 12 Apr 2012 16:58:53 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 70036 invoked by uid 500); 12 Apr 2012 16:58:53 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 70009 invoked by uid 99); 12 Apr 2012 16:58:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Apr 2012 16:58:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.112.75.238] (HELO iron-u-a-out.osis.gov) (206.112.75.238) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Apr 2012 16:58:48 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: At4EAC4Jh0+sEAbx/2dsb2JhbABDuXqBD4IJAQEBAwESAmoLCwsNLiITBR0ZGweHZ6AaCpZokX4ElWyTOQ X-IronPort-AV: E=Sophos;i="4.75,412,1330923600"; d="scan'208";a="8892271" Received: from ghost-a.center.osis.gov (HELO mail-pb0-f41.google.com) ([172.16.6.241]) by iron-u-a-in.osis.gov with ESMTP/TLS/RC4-SHA; 12 Apr 2012 12:57:35 -0400 Received: by pbcup15 with SMTP id up15so2750125pbc.0 for ; Thu, 12 Apr 2012 09:58:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.130.170 with SMTP id of10mr3990560pbb.156.1334249906292; Thu, 12 Apr 2012 09:58:26 -0700 (PDT) Received: by 10.68.221.168 with HTTP; Thu, 12 Apr 2012 09:58:26 -0700 (PDT) In-Reply-To: References: <1049765223.380354.1334198617508.JavaMail.root@linzimmb06o.imo.intelink.gov> Date: Thu, 12 Apr 2012 12:58:26 -0400 Message-ID: Subject: Re: recursive rollup From: Adam Fuchs To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=047d7b10cb6db4c8bd04bd7e437e X-Virus-Checked: Checked by ClamAV on apache.org --047d7b10cb6db4c8bd04bd7e437e Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Small correction: the branching factor would not have to be exactly 1, but it would be small on average (close to 1). Adam On Thu, Apr 12, 2012 at 12:50 PM, Adam Fuchs wrote: > This probably won't work, unless all node names are unique at a given > level. For example, given paths /a/c/d/e and /b/c/d/e you would have two > conflicting entries for "4:d/e childOf c/d". You might be able to use a > unique name instead with a similar scheme, but that could possibly > introduce a bottleneck. > > Another thought on this is if you actually have path depths in the > thousands then there's probably a compression scheme that could save you = a > lot of space and compute time, as the branching factor would necessarily = be > 1 for the majority of those nodes. > > Cheers, > Adam > > > On Wed, Apr 11, 2012 at 10:43 PM, Keith Turner wrote: > >> There is one other refinement to this table structure, the depth could >> be sorted in inverse order. To do this store (MAX_DEPTH - depth). >> This avoids the binary search the in filesystem example to find the >> max depth. So the tree would look like the following, assuming max >> dept is 999. >> >> 997:B/D childOf A/B >> 997:B/E childOf A/B >> 997:C/F childOf A/C >> 997:C/G childOf A/C >> 998:A/B childOf /A >> 998:A/C childOf /A >> >> Keith >> >> 2012/4/11 Perko, Ralph J : >> > Keith, thanks for the quick reply. I see the hierarchy is maintained i= n >> > the row id - would you still >> > recommend this approach for deep hierarchies, on the order of hundreds >> and >> > perhaps thousands? >> > >> > Maybe this is an accumulo newbie question - Is this approach better th= an >> > walking the tree? My current schema is something like >> > >> > A parentOf:B 1 >> > A parentOf:C 1 >> > B childOf:A 1 >> > C childOf:A 1 >> > And so on=85 >> > >> > Thanks, >> > Ralph >> > >> > >> > On 4/10/12 3:23 PM, "Keith Turner" wrote: >> > >> >>oooppsss I was not paying close attention, when it scans level 001 it >> >>will insert the following >> >> >> >> 000/A count =3D 6 (not 5) >> >> >> >>On Tue, Apr 10, 2012 at 6:21 PM, Keith Turner wrote= : >> >>> Take a look at the >> >>> org.apache.accumulo.examples.simple.dirlist.FileCount example, I >> >>> think it does what you are asking. >> >>> >> >>> The way it works it that assume your tree would be stored in accumul= o >> >>> as follows. >> >>> >> >>> 000/A >> >>> 001/A/B >> >>> 001/A/C >> >>> 002/A/B/D >> >>> 002/A/B/E >> >>> 002/A/C/F >> >>> 002/A/C/G >> >>> >> >>> The number before the path is the depth, therefore all data for a >> >>> particular depth of the tree is stored contiguously. The example >> >>> program scans each depth, starting with the highest depth, and push >> >>> counts up. This very efficient because there is not a lot of random >> >>> access, it sequentially reads data for each depth and stream updates >> >>> to the lower depth. >> >>> >> >>> For example when reading level 002, it would do the following two >> >>>inserts : >> >>> >> >>> 001/A/B count=3D2 >> >>> 001/A/C count=3D2 >> >>> >> >>> Then it would read level 001 and push the following insert : >> >>> >> >>> 000/A count =3D 5 >> >>> >> >>> Keith >> >>> >> >>> On Tue, Apr 10, 2012 at 6:02 PM, Perko, Ralph J > > >> >>>wrote: >> >>>> Hi, I wish to do a recursive rollup-count and am wondering the best >> >>>>way to >> >>>> do this. >> >>>> >> >>>> What I mean is this =AD in accumulo is a table with data that >> represents >> >>>>the >> >>>> nodes and leafs in a tree. Each node/leaf in accumulo knows it's >> >>>>parent >> >>>> and children and wether it is a node or leaf. I wish to have a >> >>>> rollup-count for any given node to know the combined total of all >> >>>> descendants. >> >>>> >> >>>> For example, given the tree: >> >>>> >> >>>> A >> >>>> | >> >>>> -------- >> >>>> | | >> >>>> B C >> >>>> | | >> >>>> ----- ----- >> >>>> | | | | >> >>>> D E F G >> >>>> >> >>>> "A" would have 6 descendants. >> >>>> >> >>>> I can use the SummingCombiner iterator to get a child count, e.g "A= " >> >>>>has 2 >> >>>> children, but I am not sure the best way to recurse down. The data >> is >> >>>> static so I do not necessarily need a dynamic, on-the-fly solution. >> >>>> >> >>>> Thanks for your help, >> >>>> Ralph >> > >> > > --047d7b10cb6db4c8bd04bd7e437e Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Small correction: the branching factor would not have to be exactly 1, but = it would be small on average (close to 1).

Adam


On Thu, Apr 12, 2012 at 12:50 PM, Adam= Fuchs <adam.= p.fuchs@ugov.gov> wrote:
This probably won't work, unless all nod= e names are unique at a given level. For example, given paths /a/c/d/e and = /b/c/d/e you would have two conflicting entries for "4:d/e childOf c/d= ". You might be able to use a unique name instead with a similar schem= e, but that could possibly introduce a bottleneck.

Another thought on this is if you actually have path depths = in the thousands then there's probably a compression scheme that could = save you a lot of space and compute time, as the branching factor would nec= essarily be 1 for the majority of those nodes.

Cheers,
Adam


On Wed, Apr 11, 2012 a= t 10:43 PM, Keith Turner <keith@deenlo.com> wrote:
There is one other refinement to this table = structure, the depth could
be sorted in inverse order. =A0To do this store (MAX_DEPTH - depth).
This avoids the binary search the in filesystem example to find the
max depth. =A0So the tree would look like the following, assuming max
dept is 999.

997:B/D childOf A/B
997:B/E childOf A/B
997:C/F childOf A/C
997:C/G childOf A/C
998:A/B childOf /A
998:A/C childOf /A

Keith

2012/4/11 Perko, Ralph J <Ralph.Perko@pnnl.gov>:
> Keith, thanks for the quick reply. I see the hierarchy is maintained i= n
> the row id - would you still
> recommend this approach for deep hierarchies, on the order of hundreds= and
> perhaps thousands?
>
> Maybe this is an accumulo newbie question - Is this approach better th= an
> walking the tree? =A0My current schema is something like
>
> A parentOf:B 1
> A parentOf:C 1
> B childOf:A 1
> C childOf:A 1
> And so on=85
>
> Thanks,
> Ralph
>
>
> On 4/10/12 3:23 PM, "Keith Turner" <keith@deenlo.com> wrote:
>
>>oooppsss I was not paying close attention, =A0when it scans level 0= 01 it
>>will insert the following
>>
>> =A0000/A count =3D 6 (not 5)
>>
>>On Tue, Apr 10, 2012 at 6:21 PM, Keith Turner <keith@deenlo.com> wrote:
>>> Take a look at the
>>> org.apache.accumulo.examples.simple.dirlist.FileCount example,= =A0I
>>> think it does what you are asking.
>>>
>>> The way it works it that assume your tree would be stored in a= ccumulo
>>> as follows.
>>>
>>> 000/A
>>> 001/A/B
>>> 001/A/C
>>> 002/A/B/D
>>> 002/A/B/E
>>> 002/A/C/F
>>> 002/A/C/G
>>>
>>> The number before the path is the depth, therefore all data fo= r a
>>> particular depth of the tree is stored contiguously. =A0The ex= ample
>>> program scans each depth, starting with the highest depth, and= push
>>> counts up. =A0This very efficient because there is not a lot o= f random
>>> access, it sequentially reads data for each depth and stream u= pdates
>>> to the lower depth.
>>>
>>> For example when reading level 002, it would do the following = two
>>>inserts :
>>>
>>> =A0001/A/B count=3D2
>>> =A0001/A/C count=3D2
>>>
>>> Then it would read level 001 and push the following insert : >>>
>>> =A0000/A count =3D 5
>>>
>>> Keith
>>>
>>> On Tue, Apr 10, 2012 at 6:02 PM, Perko, Ralph J <Ralph.Perko@pnnl.gov>= ;
>>>wrote:
>>>> Hi, I wish to do a recursive rollup-count and am wondering= the best
>>>>way to
>>>> do this.
>>>>
>>>> What I mean is this =AD in accumulo is a table with data t= hat represents
>>>>the
>>>> nodes and leafs in a tree. =A0Each node/leaf in accumulo k= nows it's
>>>>parent
>>>> and children and wether it is a node or leaf. =A0I wish to= have a
>>>> rollup-count for any given node to know the combined total= of all
>>>> descendants.
>>>>
>>>> For example, given the tree:
>>>>
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0A
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0|
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 --------
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 =A0|=
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 B =A0 =A0 =A0C=
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 =A0 =A0|=
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 ----- =A0-----
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 | =A0| =A0 = =A0|
>>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 D =A0 E =A0F =A0 = =A0G
>>>>
>>>> "A" would have 6 descendants.
>>>>
>>>> I can use the SummingCombiner iterator to get a child coun= t, e.g "A"
>>>>has 2
>>>> children, but I am not sure the best way to recurse down. = =A0The data is
>>>> static so I do not necessarily need a dynamic, on-the-fly = solution.
>>>>
>>>> Thanks for your help,
>>>> Ralph
>


--047d7b10cb6db4c8bd04bd7e437e--