accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Orr <michael.d....@gmail.com>
Subject Re: Optimizing Accumulo for read performance
Date Wed, 06 Nov 2013 18:57:37 GMT
Thanks for responding.


The RKEYs  for nodes are N|<NodeID> and we have CF:CQs for each edge. We
maintain the edge attributes as separate RKEYs using E<EdgeID>.


I’m not sure what you mean by repeating the node id..


Mike


On Wed, Nov 6, 2013 at 9:58 AM, William Slacum <
wilhelm.von.cloud@accumulo.net> wrote:

> When you say schema, do you mean key schema? If so, why are you repeating
> the node id?
>
> Locality groups would help if you have larger swaths of data you wanted to
> group together and query discretely from other locality groups. For
> instance, I've seen key schemas where "in" and "out" edges are grouped
> together.
>
> At a system level, if you know some information about the distribution of
> the row values (in this case, it looks like node id and edge id), you can
> pre split the table by taking some samples out of that space. This would
> distribute the tablets arounds, making queries using the batch scanner
> faster by increasing the parallelism. This would also increase the number
> of input splits generated by the input format if you wanted to do batch
> processing on the entire graph.
>
> On Wed, Nov 6, 2013 at 9:19 AM, Michael Orr <michael.d.orr@gmail.com>wrote:
>
>> Hello,
>>
>> I’m working on an application that needs fast read performance. I’ve been
>> conducting some experiments starting with a single (pseudo-distributed)
>> cluster with the intent of scaling out. However, prior to doing so, I
>> wanted to get a good gauge for how fast a single tablet server can read.
>>
>> The application processes and stores graph data with the following schema:
>>
>> for nodes:
>> N|NodeID                ID:NodeID       EIN:EdgeID
>>  EOUT:EdgeID             .. lots of other attributes
>>
>> there can be multiple EIN and EOUT CFs for each node
>>
>> for edges
>> E|EdgeID                ID:NodeID       VIN:VertexID
>>  EOUT:VertexID   .. lots of other attributes
>>
>>
>> Scans into the system can be for entire graph or a subset of nodes and
>> edges. We generally pull navigational information first, then other
>> attributes later if needed. I’ve spent some time looking into using
>> locality groups but was curious if there are recommendations on backend
>> properties that could be set to increase read time particularly if memory
>> and space were not a concern.
>>
>> Thanks for your help!
>>
>> Mike
>>
>
>

Mime
View raw message