accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kepner, Jeremy - 0553 - MITLL" <kep...@ll.mit.edu>
Subject Re: schema examples
Date Sun, 29 Dec 2013 17:12:40 GMT
FYI, we just insert all the triples into both Tedge and TedgeTranspose using seperate batchwriters
and let Accumulo figure out which ones belong in the same row. This has worked well for us.

On Dec 29, 2013, at 11:57 AM, Arshak Navruzyan <arshakn@gmail.com> wrote:

> Sorry I mixed things up.  It was in the wikisearch example:
> 
> http://accumulo.apache.org/example/wikisearch.html
> 
> "If the cardinality is small enough, it will track the set of documents by term directly."
> 
> 
> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <kepner@ll.mit.edu>
wrote:
> Hi Arshak,
>   See interspersed below.
> Regards.  -Jeremy
> 
> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <arshakn@gmail.com> wrote:
> 
>> Jeremy,
>> 
>> Thanks for the detailed explanation.  Just a couple of final questions:
>> 
>> 1.  What's your advise on the transpose table as far as whether to repeat the indexed
term (one per matching row id) or try to store all matching row ids from tedge in a single
row in tedgetranspose (using protobuf for example).  What's the performance implication of
each approach?  In the paper you mentioned that if it's a few values they should just be stored
together.  Was there a cut-off point in your testing?
> 
> Can you clarify?  I am not sure what your asking.
> 
>> 
>> 2.  You mentioned that the degrees should be calculated beforehand for high ingest
rates.  Doesn't this change Accumulo from being a true database to being more of an index?
 If changes to the data cause the degree table to get out of sync, sounds like changes have
to be applied elsewhere first and Accumulo has to be reloaded periodically.  Or perhaps letting
the degree table get out of sync is ok since it's just an assist...
> 
> My point was a very narrow comment on optimization in very high performance situations.
I probably shouldn't have mentioned it.  If you have ever have performance issues with your
degree tables, that would be the time to discuss. . You may never encounter this issue.
> 
>> Thanks,
>> 
>> Arshak
>> 
>> 
>> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <kepner@ll.mit.edu>
wrote:
>> Hi Arshak,
>>   Here is how you might do it.  We implement everything with batch writers and batch
scanners.  Note: if you are doing high ingest rates, the degree table can be tricky and usually
requires pre-summing prior to ingestion to reduce the pressure on the accumulator inside of
Accumulo.  Feel free to ask further questions as I would imagine that there a details that
still wouldn't be clear.  In particular, why we do it this way.
>> 
>> Regards.  -Jeremy
>> 
>> Original data:
>> 
>> Machine,Pool,Load,ReadingTimestamp
>> neptune,west,5,1388191975000
>> neptune,west,9,1388191975010
>> pluto,east,13,1388191975090
>> 
>> 
>> Tedge table:
>> rowKey,columnQualifier,value
>> 
>> 0005791918831-neptune,Machine|neptune,1
>> 0005791918831-neptune,Pool|west,1
>> 0005791918831-neptune,Load|5,1
>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>> 0105791918831-neptune,Machine|neptune,1
>> 0105791918831-neptune,Pool|west,1
>> 0105791918831-neptune,Load|9,1
>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>> 0905791918831-pluto,Machine|pluto,1
>> 0905791918831-pluto,Pool|east,1
>> 0905791918831-pluto,Load|13,1
>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>> 
>> 
>> TedgeTranspose table:
>> rowKey,columnQualifier,value
>> 
>> Machine|neptune,0005791918831-neptune,1
>> Pool|west,0005791918831-neptune,1
>> Load|5,0005791918831-neptune,1
>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>> Machine|neptune,0105791918831-neptune,1
>> Pool|west,0105791918831-neptune,1
>> Load|9,0105791918831-neptune,1
>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>> Machine|pluto,0905791918831-pluto,1
>> Pool|east,0905791918831-pluto,1
>> Load|13,0905791918831-pluto,1
>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>> 
>> 
>> TedgeDegree table:
>> rowKey,columnQualifier,value
>> 
>> Machine|neptune,Degree,2
>> Pool|west,Degree,2
>> Load|5,Degree,1
>> ReadingTimestamp|1388191975000,Degree,1
>> Load|9,Degree,1
>> ReadingTimestamp|1388191975010,Degree,1
>> Machine|pluto,Degree,1
>> Pool|east,Degree,1
>> Load|13,Degree,1
>> ReadingTimestamp|1388191975090,Degree,1
>> 
>> 
>> TedgeText table:
>> rowKey,columnQualifier,value
>> 
>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>> 
>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <arshakn@gmail.com> wrote:
>> 
>> > Jeremy,
>> >
>> > Wow, didn't expect to get help from the author :)
>> >
>> > How about something simple like this:
>> >
>> > Machine    Pool      Load        ReadingTimestamp
>> > neptune     west      5            1388191975000
>> > neptune     west      9            1388191975010
>> > pluto         east       13           1388191975090
>> >
>> > These are the areas I am unclear on:
>> >
>> > 1.  Should the transpose table be built as part of ingest code or as an accumulo
combiner?
>> > 2.  What does the degree table do in this example ?  The paper mentions it's
useful for query optimization.  How?
>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?  The wikisearch
shows how the partition id is important for parallel scans of the index.  But since Accumulo
is a row store how can you do fast lookups by row if you've used the row_id as a partition
key.
>> >
>> > Thank you,
>> >
>> > Arshak
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <kepner@ll.mit.edu> wrote:
>> > Hi Arshak,
>> >   Maybe you can send a few (~3) records of data that you are familiar with
>> > and we can walk you through how the D4M schema would be applied to those records.
>> >
>> > Regards.  -Jeremy
>> >
>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> > >    Hello,
>> > >    I am trying to get my head around Accumulo schema designs.  I went through
>> > >    a lot of trouble to get the wikisearch example running but since the
data
>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>> > >    Would love to find another example that is a little simpler to understand.
>> > >     In particular I am interested in java/scala code that mimics the D4M
>> > >    schema design (not a Matlab guy).
>> > >    Thanks,
>> > >    Arshak
>> >
>> 
>> 
> 
> 


Mime
View raw message