@eric, assuming the records are evenly distributed and network bandwidth is
not an issue, shouldn't that be O(n/p)+O(p) and O(n/p * log (n/p))?
On Wed, Oct 24, 2012 at 2:45 PM, Eric Newton <eric.newton@gmail.com> wrote:
> Adding a sorted file to accumulo (bulk loading) is essentially
> constant in the normal case. It is O(n) + O(p) for the worst case
> where the index must be read, and the file assigned to every tablet
> server. In this case, the (slow) RPCs will dominate over the (fast)
> read of the index, except for very small clusters or very large
> indexes.
>
> Inserting with the BatchWriter is eventually dominated by compactions,
> which is a merge sort, or O(n log n).
>
> Eric
>
> On Thu, Oct 18, 2012 at 11:37 AM, Jeff Kubina <jeff.kubina@gmail.com>
> wrote:
> > BatchWriter, but I would be interested in the answer assuming a
> > presorted rfile.
> >
> > On Thu, Oct 18, 2012 at 11:20 AM, Josh Elser <josh.elser@gmail.com>
> wrote:
> >> Are you referring to "bulk inserts" as importing a presorted rfile of
> >> Key/Values or usinga BatchWriter?
> >>
> >> On 10/18/12 10:49 AM, Jeff Kubina wrote:
> >>>
> >>> I am deriving the time complexities for an algorithm I implemented in
> >>> Hadoop using Accumulo and need to know the time complexity of bulk
> >>> inserting m records evenly distributed across p nodes into an empty
> >>> table with p tablet servers. Assuming B is the bandwidth of the
> >>> network, would the communication complexity be O(m/B) and the
> >>> computation complexity O(m/p * log(m/p))? If the table contained n
> >>> records would the values be O(m/B) and O(m/p * log(m/p) + n/p)?
>
