accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arshak Navruzyan <arsh...@gmail.com>
Subject Re: schema examples
Date Sun, 29 Dec 2013 20:10:11 GMT
Got it, thanks again Jeremy!


On Sun, Dec 29, 2013 at 9:12 AM, Kepner, Jeremy - 0553 - MITLL <
kepner@ll.mit.edu> wrote:

> FYI, we just insert all the triples into both Tedge and TedgeTranspose
> using seperate batchwriters and let Accumulo figure out which ones belong
> in the same row. This has worked well for us.
>
> On Dec 29, 2013, at 11:57 AM, Arshak Navruzyan <arshakn@gmail.com> wrote:
>
> Sorry I mixed things up.  It was in the wikisearch example:
>
> http://accumulo.apache.org/example/wikisearch.html
>
> "If the cardinality is small enough, it will track the set of documents by
> term directly."
>
>
> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <
> kepner@ll.mit.edu> wrote:
>
>> Hi Arshak,
>>   See interspersed below.
>> Regards.  -Jeremy
>>
>> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <arshakn@gmail.com> wrote:
>>
>> Jeremy,
>>
>> Thanks for the detailed explanation.  Just a couple of final questions:
>>
>> 1.  What's your advise on the transpose table as far as whether to repeat
>> the indexed term (one per matching row id) or try to store all matching row
>> ids from tedge in a single row in tedgetranspose (using protobuf for
>> example).  What's the performance implication of each approach?  In the
>> paper you mentioned that if it's a few values they should just be stored
>> together.  Was there a cut-off point in your testing?
>>
>>
>> Can you clarify?  I am not sure what your asking.
>>
>>
>> 2.  You mentioned that the degrees should be calculated beforehand for
>> high ingest rates.  Doesn't this change Accumulo from being a true database
>> to being more of an index?  If changes to the data cause the degree table
>> to get out of sync, sounds like changes have to be applied elsewhere first
>> and Accumulo has to be reloaded periodically.  Or perhaps letting the
>> degree table get out of sync is ok since it's just an assist...
>>
>>
>> My point was a very narrow comment on optimization in very high
>> performance situations. I probably shouldn't have mentioned it.  If you
>> have ever have performance issues with your degree tables, that would be
>> the time to discuss. . You may never encounter this issue.
>>
>> Thanks,
>>
>> Arshak
>>
>>
>> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <
>> kepner@ll.mit.edu> wrote:
>>
>>> Hi Arshak,
>>>   Here is how you might do it.  We implement everything with batch
>>> writers and batch scanners.  Note: if you are doing high ingest rates, the
>>> degree table can be tricky and usually requires pre-summing prior to
>>> ingestion to reduce the pressure on the accumulator inside of Accumulo.
>>>  Feel free to ask further questions as I would imagine that there a details
>>> that still wouldn't be clear.  In particular, why we do it this way.
>>>
>>> Regards.  -Jeremy
>>>
>>> Original data:
>>>
>>> Machine,Pool,Load,ReadingTimestamp
>>> neptune,west,5,1388191975000
>>> neptune,west,9,1388191975010
>>> pluto,east,13,1388191975090
>>>
>>>
>>> Tedge table:
>>> rowKey,columnQualifier,value
>>>
>>> 0005791918831-neptune,Machine|neptune,1
>>> 0005791918831-neptune,Pool|west,1
>>> 0005791918831-neptune,Load|5,1
>>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>>> 0105791918831-neptune,Machine|neptune,1
>>> 0105791918831-neptune,Pool|west,1
>>> 0105791918831-neptune,Load|9,1
>>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>>> 0905791918831-pluto,Machine|pluto,1
>>> 0905791918831-pluto,Pool|east,1
>>> 0905791918831-pluto,Load|13,1
>>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>>
>>>
>>> TedgeTranspose table:
>>> rowKey,columnQualifier,value
>>>
>>> Machine|neptune,0005791918831-neptune,1
>>> Pool|west,0005791918831-neptune,1
>>> Load|5,0005791918831-neptune,1
>>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>>> Machine|neptune,0105791918831-neptune,1
>>> Pool|west,0105791918831-neptune,1
>>> Load|9,0105791918831-neptune,1
>>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>>> Machine|pluto,0905791918831-pluto,1
>>> Pool|east,0905791918831-pluto,1
>>> Load|13,0905791918831-pluto,1
>>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>>
>>>
>>> TedgeDegree table:
>>> rowKey,columnQualifier,value
>>>
>>> Machine|neptune,Degree,2
>>> Pool|west,Degree,2
>>> Load|5,Degree,1
>>> ReadingTimestamp|1388191975000,Degree,1
>>> Load|9,Degree,1
>>> ReadingTimestamp|1388191975010,Degree,1
>>> Machine|pluto,Degree,1
>>> Pool|east,Degree,1
>>> Load|13,Degree,1
>>> ReadingTimestamp|1388191975090,Degree,1
>>>
>>>
>>> TedgeText table:
>>> rowKey,columnQualifier,value
>>>
>>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>>>
>>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <arshakn@gmail.com> wrote:
>>>
>>> > Jeremy,
>>> >
>>> > Wow, didn't expect to get help from the author :)
>>> >
>>> > How about something simple like this:
>>> >
>>> > Machine    Pool      Load        ReadingTimestamp
>>> > neptune     west      5            1388191975000
>>> > neptune     west      9            1388191975010
>>> > pluto         east       13           1388191975090
>>> >
>>> > These are the areas I am unclear on:
>>> >
>>> > 1.  Should the transpose table be built as part of ingest code or as
>>> an accumulo combiner?
>>> > 2.  What does the degree table do in this example ?  The paper
>>> mentions it's useful for query optimization.  How?
>>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>>  The wikisearch shows how the partition id is important for parallel scans
>>> of the index.  But since Accumulo is a row store how can you do fast
>>> lookups by row if you've used the row_id as a partition key.
>>> >
>>> > Thank you,
>>> >
>>> > Arshak
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <kepner@ll.mit.edu>
>>> wrote:
>>> > Hi Arshak,
>>> >   Maybe you can send a few (~3) records of data that you are familiar
>>> with
>>> > and we can walk you through how the D4M schema would be applied to
>>> those records.
>>> >
>>> > Regards.  -Jeremy
>>> >
>>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>>> > >    Hello,
>>> > >    I am trying to get my head around Accumulo schema designs.  I
>>> went through
>>> > >    a lot of trouble to get the wikisearch example running but since
>>> the data
>>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>>> > >    Would love to find another example that is a little simpler to
>>> understand.
>>> > >     In particular I am interested in java/scala code that mimics the
>>> D4M
>>> > >    schema design (not a Matlab guy).
>>> > >    Thanks,
>>> > >    Arshak
>>> >
>>>
>>>
>>
>>
>
>

Mime
View raw message