accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Lynch <>
Subject Re: Wikisearch Performance Question
Date Tue, 21 May 2013 18:35:36 GMT

I think your description is accurate, except that I split the single archive into a much greater
number of pieces than the number of different archives I ingested. Specifically, I set numGroups
to a higher number, I didn't split the archive my hand in hdfs. The archives are bzip2-ed,
not gzip-ed. Will decompressing still have the same benefit?

-----Original Message-----
From: Josh Elser <>
To: user <>
Sent: Tue, May 21, 2013 2:16 pm
Subject: Re: Wikisearch Performance Question

Let me see if I understand what you're asking: you took one mediawiki 
archive and split it into n archives of size 1/n the original. You then 
took many n _different_ mediawiki archives and ingested those. You tried 
to get the speed of ingesting many different archives be as fast as 
splitting an original single archive?

Are you using gzip'ed input files? Have you tried just decompressing the 
gzip into plaintext? Hadoop will naturally split uncompressed text and 
and give you nice balancing.

I haven't looked at the ingest code in a long time. Not sure if it ever 
received much love.

On 5/21/13 1:30 PM, Patrick Lynch wrote:
> user@accumulo,
> I was working with the Wikipedia Accumulo ingest examples, and I was
> trying to get the ingest of a single archive file to be as fast as
> ingesting multiple archives through parallelization. I increased the
> number of ways the job split the single archive so that all the servers
> could work on ingesting at the same time. What I noticed, however, was
> that having all the servers work on ingesting the same file was still
> not nearly as fast as using multiple ingest files. I was wondering if I
> could have some insight into the design of the Wikipedia ingest that
> could explain this phenomenon.
> Thank you for your time,
> Patrick Lynch


View raw message