IPv6 can support up to 281,474,976,710,656 networks. Assuming you only want
to group by networks, that is already a potentially very large keyspace.
The *minimum* number of distinct addresses a V6 network can contain (the
smallest advertisable prefix is /48) is 1,208,925,819,614,629,174,706,176.
This is a bigger problem, because if you also are counting distinct
addresses, then let's hope the observations you are counting within this
space are very very sparse, or yeah, it may take a while to calculate that
aggregate. I don't have a good answer for adjusting to the scale of IPv6,
but old V4 notions of counting distinct addresses by address may no longer
be useful. Consider a device on a /48. It could use a unique address for
every packet and not exhaust it's network space for 383,093,657,352 years
at the rate of 100Kpps. This is a pathological case (we assume malicious
actors) but still the question is in V6 is it useful to use an address as a
proxy for the identity of a unique endpoint? Counting by a product GUID
instead would bring the size of the keyspace down into the millions of rows
only. This seems a good alternate strategy. If you don't control the
endpoint and still want to count unique conversations, I would determine
the physical path between endpoints and construct an identifier based on
that. Our planet is very small compared to the astronomical scale of V6.
On Sun, Jan 27, 2013 at 9:37 AM, JeanMarc Spaggiari <
jeanmarc@spaggiari.org> wrote:
> What I would like is to have a faster (direct?) access to the number
> of entries starting with "058".
>
> For IPv4 it's 0 to 255, so working fine. For for IPv6, it can take a
> while to scan the full range and aggregate.
>
> JM
>
> 2013/1/27, lars hofhansl <larsh@apache.org>:
> > I might be missing something. Why don't just have a counter per IP and
> then
> > aggregate at read time?
> > If you wanted the total of the 058 group you'd start a scanner with
> "058" as
> > start row and "058\0" as stop row. On the client you sum up the counter
> > values.
> > Similarly for the 109.169 group. Start with "109.169" and stop
> "109.169\0".
> >
> >  Lars
> >
> >
> >
> > ________________________________
> > From: JeanMarc Spaggiari <jeanmarc@spaggiari.org>
> > To: user <user@hbase.apache.org>
> > Sent: Sunday, January 27, 2013 8:51 AM
> > Subject: Tables vs CFs vs Cs
> >
> > Hi,
> >
> > Let's imagine this scenario.
> >
> > I want to store IPs with counters. And I want to have counters by
> > groups of IPs. All of that will be calculated with MR jobs and stored
> > in HBase.
> >
> > Let's take some IPs and make sure they are ordered by adding some "0"
> > when required.
> >
> > 037.113.031.119
> > 058.022.018.176
> > 058.022.159.151
> > 109.169.201.076
> > 109.169.201.150
> > 109.254.019.140
> > 122.031.039.016
> > 122.224.005.210
> > 178.137.167.041
> >
> > I want to have counters for all "levels" of those IPs. Which mean for
> > those groups.
> >
> > Group 1:
> > 037
> > 058
> > 109
> > 122
> > 178
> >
> > Group 2:
> >
> > 037.113
> > 058.022
> > 109.169
> > 109.254
> > 122.031
> > 122.224
> > 178.167
> >
> > Group 3:
> >
> > 037.113.031
> > 058.022.018
> > 058.022.159
> > 109.169.201
> > 109.254.019
> > 122.031.039
> > 122.224.005
> > 178.137.167
> >
> > And group 4 is the complete IPs list.
> >
> > Each time I see an IP, I will increment the required values into the 4
> > groups.
> >
> > What's the bests way to store that knowing that I want to be able to
> > easily list all the entries (ranged based) from one group.
> >
> > Option 1 is to have one table per group. 1CF, 1C
> > Pros: Very easy to access, retrieve, etc.
> > Cons: Will generate 4 tables
> >
> > Option 2 is to have one table, but 1 CF per group.
> > Pros: Only one table, easy access.
> > Cons: Heard that we should try to keep CFs under 3. Might have bad
> > performances impacts.
> >
> > Option 3 is to have one table, one CF and one C per group.
> > Pros: Only one table, only one CF.
> > Cons: Access is less easy than option 1 and 2.
> >
> > I think Option 2 is the worst one. Option 1 is very easy to implement.
> > And for option 3, I don't see any benefit compared to option 1.
> >
> > So I'm tempted to go with option 1, but I don't like the idea of
> > multiplying the table.
> >
> > Does anyone have any comment on which options might be the best one,
> > or even proposed another option?
> >
> > JM
>

Best regards,
 Andy
Problems worthy of attack prove their worth by hitting back.  Piet Hein
(via Tom White)
