accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arshak Navruzyan <arsh...@gmail.com>
Subject Re: schema examples
Date Sun, 29 Dec 2013 22:45:28 GMT
Josh, I am still a little stuck on the idea of how this would work in a
transactional app? (aka mixed workload of reads and writes).

I definitely see the power of using a serialized structure in order to
minimize the number of records but what will happen when rows get deleted
out of the main table (or mutated)?   In the bloated model I could see some
referential integrity code zapping the index entries as well.  In the
serialized structure design it seems pretty complex to go and update every
array that referenced that row.

Is it fair to say that the D4M approach is a little better suited for
transactional apps and the wikisearch approach is better for read-optimized
index apps?


On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Some context here in regards to the wikisearch:
>
> The point of the protocol buffers here (or any serialized structure in the
> Value) is to reduce the ingest pressure and increase query performance on
> the inverted index (or transpose table, if I follow the d4m phrasing).
>
> This works well because most languages (especially English) follow a
> Zipfian distribution: some terms appear very frequently while some occur
> very infrequently. For common terms, we don't want to bloat our index, nor
> spend time creating those index records (e.g. "the"). For uncommon terms,
> we still want direct access to these infrequent words (e.g. "
> supercalifragilisticexpialidocious")
>
> The ingest affect is also rather interesting when dealing with Accumulo as
> you're not just writing more data, but typically writing data to most (if
> not all) tservers. Even the tokenization of a single document is likely to
> create inserts to a majority of the tablets for your inverted index. When
> dealing with high ingest rates (live *or* bulk -- you still have the send
> data to these servers), minimizing the number of records becomes important
> to be cognizant of as it may be a bottleneck in your pipeline.
>
> The query implications are pretty straightforward: common terms don't
> bloat the index in size nor affect uncommon term lookups and those uncommon
> term lookups remain specific to documents rather than a range (shard) of
> documents.
>
>
> On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
>
>> Sorry I mixed things up.  It was in the wikisearch example:
>>
>> http://accumulo.apache.org/example/wikisearch.html
>>
>> "If the cardinality is small enough, it will track the set of documents
>> by term directly."
>>
>>
>> On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>> <kepner@ll.mit.edu <mailto:kepner@ll.mit.edu>> wrote:
>>
>>     Hi Arshak,
>>        See interspersed below.
>>     Regards.  -Jeremy
>>
>>     On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <arshakn@gmail.com
>>     <mailto:arshakn@gmail.com>> wrote:
>>
>>      Jeremy,
>>>
>>>     Thanks for the detailed explanation.  Just a couple of final
>>>     questions:
>>>
>>>     1.  What's your advise on the transpose table as far as whether to
>>>     repeat the indexed term (one per matching row id) or try to store
>>>     all matching row ids from tedge in a single row in tedgetranspose
>>>     (using protobuf for example).  What's the performance implication
>>>     of each approach?  In the paper you mentioned that if it's a few
>>>     values they should just be stored together.  Was there a cut-off
>>>     point in your testing?
>>>
>>
>>     Can you clarify?  I am not sure what your asking.
>>
>>
>>>     2.  You mentioned that the degrees should be calculated beforehand
>>>     for high ingest rates.  Doesn't this change Accumulo from being a
>>>     true database to being more of an index?  If changes to the data
>>>     cause the degree table to get out of sync, sounds like changes
>>>     have to be applied elsewhere first and Accumulo has to be reloaded
>>>     periodically.  Or perhaps letting the degree table get out of sync
>>>     is ok since it's just an assist...
>>>
>>
>>     My point was a very narrow comment on optimization in very high
>>     performance situations. I probably shouldn't have mentioned it.  If
>>     you have ever have performance issues with your degree tables, that
>>     would be the time to discuss. . You may never encounter this issue.
>>
>>      Thanks,
>>>
>>>     Arshak
>>>
>>>
>>>     On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL
>>>     <kepner@ll.mit.edu <mailto:kepner@ll.mit.edu>> wrote:
>>>
>>>         Hi Arshak,
>>>           Here is how you might do it.  We implement everything with
>>>         batch writers and batch scanners.  Note: if you are doing high
>>>         ingest rates, the degree table can be tricky and usually
>>>         requires pre-summing prior to ingestion to reduce the pressure
>>>         on the accumulator inside of Accumulo.  Feel free to ask
>>>         further questions as I would imagine that there a details that
>>>         still wouldn't be clear.  In particular, why we do it this way.
>>>
>>>         Regards.  -Jeremy
>>>
>>>         Original data:
>>>
>>>         Machine,Pool,Load,ReadingTimestamp
>>>         neptune,west,5,1388191975000
>>>         neptune,west,9,1388191975010
>>>         pluto,east,13,1388191975090
>>>
>>>
>>>         Tedge table:
>>>         rowKey,columnQualifier,value
>>>
>>>         0005791918831-neptune,Machine|neptune,1
>>>         0005791918831-neptune,Pool|west,1
>>>         0005791918831-neptune,Load|5,1
>>>         0005791918831-neptune,ReadingTimestamp|1388191975000,1
>>>         0105791918831-neptune,Machine|neptune,1
>>>         0105791918831-neptune,Pool|west,1
>>>         0105791918831-neptune,Load|9,1
>>>         0105791918831-neptune,ReadingTimestamp|1388191975010,1
>>>         0905791918831-pluto,Machine|pluto,1
>>>         0905791918831-pluto,Pool|east,1
>>>         0905791918831-pluto,Load|13,1
>>>         0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>>
>>>
>>>         TedgeTranspose table:
>>>         rowKey,columnQualifier,value
>>>
>>>         Machine|neptune,0005791918831-neptune,1
>>>         Pool|west,0005791918831-neptune,1
>>>         Load|5,0005791918831-neptune,1
>>>         ReadingTimestamp|1388191975000,0005791918831-neptune,1
>>>         Machine|neptune,0105791918831-neptune,1
>>>         Pool|west,0105791918831-neptune,1
>>>         Load|9,0105791918831-neptune,1
>>>         ReadingTimestamp|1388191975010,0105791918831-neptune,1
>>>         Machine|pluto,0905791918831-pluto,1
>>>         Pool|east,0905791918831-pluto,1
>>>         Load|13,0905791918831-pluto,1
>>>         ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>>
>>>
>>>         TedgeDegree table:
>>>         rowKey,columnQualifier,value
>>>
>>>         Machine|neptune,Degree,2
>>>         Pool|west,Degree,2
>>>         Load|5,Degree,1
>>>         ReadingTimestamp|1388191975000,Degree,1
>>>         Load|9,Degree,1
>>>         ReadingTimestamp|1388191975010,Degree,1
>>>         Machine|pluto,Degree,1
>>>         Pool|east,Degree,1
>>>         Load|13,Degree,1
>>>         ReadingTimestamp|1388191975090,Degree,1
>>>
>>>
>>>         TedgeText table:
>>>         rowKey,columnQualifier,value
>>>
>>>         0005791918831-neptune,Text,< ... raw text of original log ...>
>>>         0105791918831-neptune,Text,< ... raw text of original log ...>
>>>         0905791918831-pluto,Text,< ... raw text of original log ...>
>>>
>>>         On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>>>         <arshakn@gmail.com <mailto:arshakn@gmail.com>> wrote:
>>>
>>>         > Jeremy,
>>>         >
>>>         > Wow, didn't expect to get help from the author :)
>>>         >
>>>         > How about something simple like this:
>>>         >
>>>         > Machine    Pool      Load        ReadingTimestamp
>>>         > neptune     west      5            1388191975000
>>>         > neptune     west      9            1388191975010
>>>         > pluto         east       13           1388191975090
>>>         >
>>>         > These are the areas I am unclear on:
>>>         >
>>>         > 1.  Should the transpose table be built as part of ingest
>>>         code or as an accumulo combiner?
>>>         > 2.  What does the degree table do in this example ?  The
>>>         paper mentions it's useful for query optimization.  How?
>>>         > 3.  Does D4M accommodate "repurposing" the row_id to a
>>>         partition key?  The wikisearch shows how the partition id is
>>>         important for parallel scans of the index.  But since Accumulo
>>>         is a row store how can you do fast lookups by row if you've
>>>         used the row_id as a partition key.
>>>         >
>>>         > Thank you,
>>>         >
>>>         > Arshak
>>>         >
>>>         >
>>>         >
>>>         >
>>>         >
>>>         >
>>>         > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>>>         <kepner@ll.mit.edu <mailto:kepner@ll.mit.edu>> wrote:
>>>         > Hi Arshak,
>>>         >   Maybe you can send a few (~3) records of data that you are
>>>         familiar with
>>>         > and we can walk you through how the D4M schema would be
>>>         applied to those records.
>>>         >
>>>         > Regards.  -Jeremy
>>>         >
>>>         > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan
>>>         wrote:
>>>         > >    Hello,
>>>         > >    I am trying to get my head around Accumulo schema
>>>         designs.  I went through
>>>         > >    a lot of trouble to get the wikisearch example running
>>>         but since the data
>>>         > >    in protobuf lists, it's not that illustrative (for a
>>>         newbie).
>>>         > >    Would love to find another example that is a little
>>>         simpler to understand.
>>>         > >     In particular I am interested in java/scala code that
>>>         mimics the D4M
>>>         > >    schema design (not a Matlab guy).
>>>         > >    Thanks,
>>>         > >    Arshak
>>>         >
>>>
>>>
>>>
>>
>>

Mime
View raw message