accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dylan Hutchison <dhutc...@stevens.edu>
Subject Re: schema examples
Date Sat, 28 Dec 2013 05:53:24 GMT
Hi Arshak,

Perhaps this might help, though I will defer to Jeremy on the finer points.
 Keep in mind that writing methods that interface with the D4M schema will
duplicate some of the Matlab D4M code.  Jeremy, is the Java source for the
D4M JARs available?

First let's weigh potential choices for row IDs:

   1. row ID = ReadingTimestamp.  The advantage is that you can easily
   query for continuous periods of time via the row, but the disadvantage is
   that all of your writes will go at the end to the last tablet server,
   assuming that your data is streaming in live.  If your data is not
   streaming in live but available offline, you can create table splits to
   distribute the time stamps evenly across your tablet servers.  Scanning may
   or may not bottleneck on one tablet server depending on your use case.
   2. row ID = ReadingTimestamp *reversed*.  For example, use 0005791918831
   instead of 1388191975000.  Now writes will naturally distribute across all
   your tablet servers distribute.  However, you lose the ability to easily
   query continuous periods of time.
   3. row ID = a new unique ID assigned to each row in your table, leaving
   the ReadingTimestamp as another column.  This is a good all-around solution
   for the regular table, but what about the transpose table?  Because it
   switches the rows and columns, we still have the disadvantage in option #1
   in the transpose table.

Since we don't have a clear winner, I wrote a quick implementation of
option #1.  Here's the original table, followed by the exploded table, its
transpose and the degree table.

Original
              Load,Machine,Pool,
1388191975000,005, neptune,west,
1388191975010,009, neptune,west,
1388191975090,013, pluto,  east,

Exploded

Load|005,Load|009,Load|013,Machine|neptune,Machine|pluto,Pool|east,Pool|west,
1388191975000,1,                         1,
     1,
1388191975010,         1,                1,
     1,
1388191975090,                  1,                       1,            1,



Transpose
                1388191975000,1388191975010,1388191975090,
Load|005,       1,
Load|009,                     1,
Load|013,                                   1,
Machine|neptune,1,            1,
Machine|pluto,                              1,
Pool|east,                                  1,
Pool|west,      1,            1,


Degree
                degree,
Load|005,       1,
Load|009,       1,
Load|013,       1,
Machine|neptune,2,
Machine|pluto,  1,
Pool|east,      1,
Pool|west,      2,


*1.  Should the transpose table be built as part of ingest code or as an
accumulo combiner?*
I recommend ingest code for much greater simplicity, though it may be
possible to build a combiner to automatically ingest to a second table.
 When inserting (row,col,val) triples, do another insert to the transpose
with (col,row,val).
Use summing combiners to create the degree tables.

*2.  What does the degree table do in this example ?  The paper mentions
it's useful for query optimization.  How?  *
Suppose you're interested in querying the row timestamps that log a *load >
5* (query A) and come from the *east pool *(query B). (You want A & B.)  *I
assume that the number of returned rows matching A & B is far less than the
total number of rows.*  (If they are roughly equal, you must query all the
rows anyway and a database won't help you.)  There are a few options:

   1. Scan through all the rows and return only those that match A & B.
    Sure, this will do the job, but what if you have 100M rows?  We can do
   faster with our transpose table.
   2. Scan through the transpose table for the timestamps that match query
   A.  In our example, this returns 2 rows with loads > 5.  Then retain only
   the rows that also match query B and come from the East pool.
   3. Scan through the transpose table for the timestamps that match query
   B.  In our example, this returns 1 row that comes from the East pool.  Then
   retain only the rows that also match query A and have loads > 5.

How can we tell whether query strategy #2 or #3 is better?  Choose the one
that returns the fewest rows first!  We determine that by consulting the
degree table twice (two quick lookups for counts) and verifying that there
are 2 rows from query A and 1 row for query B, thus making strategy #3 the
better choice.  Of course, the difference is 1 row in our small example,
but it could be much larger on a real database.

*3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The
wikisearch shows how the partition id is important for parallel scans of
the index.  But since Accumulo is a row store how can you do fast lookups
by row if you've used the row_id as a partition key.*
Table design is tough when optimizing for the general case.  Hopefully the
above tradeoff analysis helps present different options.


I attached the CSV file and a few lines of Matlab code to recreate the
tables.  If you don't mind reading a little Matlab, take a look at the
tutorial here:

https://github.com/denine99/d4mBB

I work out a very similar query optimization solution on baseball data in
the section "%% Find how many players weigh < 200 lb. and bat with left
hand or both hands".

Cheers,
Dylan Hutchison



On Fri, Dec 27, 2013 at 8:01 PM, Arshak Navruzyan <arshakn@gmail.com> wrote:

> Jeremy,
>
> Wow, didn't expect to get help from the author :)
>
> How about something simple like this:
>
> Machine    Pool      Load        ReadingTimestamp
> neptune     west      5            1388191975000
> neptune     west      9            1388191975010
> pluto         east       13           1388191975090
>
> These are the areas I am unclear on:
>
> 1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?
> 2.  What does the degree table do in this example ?  The paper mentions
> it's useful for query optimization.  How?
> 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The
> wikisearch shows how the partition id is important for parallel scans of
> the index.  But since Accumulo is a row store how can you do fast lookups
> by row if you've used the row_id as a partition key.
>
> Thank you,
>
> Arshak
>
>
>
>
>
>
> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <kepner@ll.mit.edu> wrote:
>
>> Hi Arshak,
>>   Maybe you can send a few (~3) records of data that you are familiar with
>> and we can walk you through how the D4M schema would be applied to those
>> records.
>>
>> Regards.  -Jeremy
>>
>> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> >    Hello,
>> >    I am trying to get my head around Accumulo schema designs.  I went
>> through
>> >    a lot of trouble to get the wikisearch example running but since the
>> data
>> >    in protobuf lists, it's not that illustrative (for a newbie).
>> >    Would love to find another example that is a little simpler to
>> understand.
>> >     In particular I am interested in java/scala code that mimics the D4M
>> >    schema design (not a Matlab guy).
>> >    Thanks,
>> >    Arshak
>>
>
>


-- 
www.cs.stevens.edu/~dhutchis

Mime
View raw message