accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Fuchs <afu...@apache.org>
Subject Re: What is the Communication and Time Complexity for Bulk Inserts?
Date Wed, 24 Oct 2012 21:50:12 GMT
For the bulk load of one file, shouldn't it be roughly O(log(n) * log(P) *
p), where n is the size of the file, P is the total number of tablets
(proportional to tablet servers), and p is the number of tablets that get
assigned that file?

For the BatchWriter case, there's a client-side lookup/binning that takes
O(log(p)) per entry, so the latter would be O(n/p * (log(n/p) + log(p)))
for each of p partitions. So, O(n*log(n)) in aggregate. Yes/no?

Adam


On Wed, Oct 24, 2012 at 3:57 PM, Jeff Kubina <jeff.kubina@gmail.com> wrote:

> @eric, assuming the records are evenly distributed and network bandwidth
> is not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
>
>
> On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <eric.newton@gmail.com>wrote:
>
>> Adding a sorted file to accumulo (bulk loading) is essentially
>> constant in the normal case.  It is O(n) + O(p) for the worst case
>> where the index must be read, and the file assigned to every tablet
>> server.  In this case, the (slow) RPCs will dominate over the (fast)
>> read of the index, except for very small clusters or very large
>> indexes.
>>
>> Inserting with the BatchWriter is eventually dominated by compactions,
>> which is a merge sort, or O(n log n).
>>
>> -Eric
>>
>> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <jeff.kubina@gmail.com>
>> wrote:
>> > BatchWriter, but I would be interested in the answer assuming a
>> > pre-sorted rfile.
>> >
>> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <josh.elser@gmail.com>
>> wrote:
>> >> Are you referring to "bulk inserts" as importing a pre-sorted rfile of
>> >> Key/Values or usinga BatchWriter?
>> >>
>> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
>> >>>
>> >>> I am deriving the time complexities for an algorithm I implemented in
>> >>> Hadoop using Accumulo and need to know the time complexity of bulk
>> >>> inserting m records evenly distributed across p nodes into an empty
>> >>> table with p tablet servers. Assuming B is the bandwidth of the
>> >>> network, would the communication complexity be O(m/B) and the
>> >>> computation complexity O(m/p * log(m/p))? If the table contained n
>> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
>>
>
>

Mime
View raw message