From user-return-3317-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Wed Feb 04 11:25:58 2009 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 63374 invoked from network); 4 Feb 2009 11:25:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Feb 2009 11:25:58 -0000 Received: (qmail 99382 invoked by uid 500); 4 Feb 2009 11:25:52 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 99332 invoked by uid 500); 4 Feb 2009 11:25:51 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 99321 invoked by uid 99); 4 Feb 2009 11:25:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2009 03:25:51 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [83.97.50.139] (HELO jan.prima.de) (83.97.50.139) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2009 11:25:43 +0000 Received: from dhcp-153.mir.bar (dslb-088-075-202-241.pools.arcor-ip.net [::ffff:88.75.202.241]) (AUTH: LOGIN jan, TLS: TLSv1/SSLv3,128bits,AES128-SHA) by jan.prima.de with esmtp; Wed, 04 Feb 2009 11:25:19 +0000 Message-Id: From: Jan Lehnardt To: user@couchdb.apache.org In-Reply-To: <00163646d64e341714046213e58e@google.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: data loading Date: Wed, 4 Feb 2009 12:25:18 +0100 References: <00163646d64e341714046213e58e@google.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org On 4 Feb 2009, at 09:51, rhettg@gmail.com wrote: > So i've got it running now at about 30 megs a minute now, which I > think is going to work fine. > Should take about an hour per day of data. > > The python process and couchdb process seem to be using about 100% > of a single CPU. That could be the JSON conversion. > In terms of getting as much data in as fast as I can, how should I > go about parallelizing this process ? > How well does couchdb (and erlang is suppose) make use of multiple > CPUs in linux ? > > Is it better to: > 1. Run multiple importers against the same db > 2. Run multiple importers against different db's and merge > (replicate) together on the same box > 3. Run multiple importers on different db's on different machines > and replicate them together ? All depends on your data and hardware. All writes to a single db get serialized. If you have a single writer that can fill all the bandwidth for your single disk, that's all you need. but usually it is not and adding more writers can help. Splitting writes over multiple databases only helps if you can generate more writes than a single disk can handle and you have multiple disks. Replication uses bulk inserts, so the final migration step is a bottleneck again. If you need to sustain a higher write rate, you need to keep your data in multiple databases and merge on read. For simple data import, try 2-N writers into the same DB. Everything else is way too complicated :) Cheers Jan -- > > > I'm going to experiment with some of these setups (if they're even > possible, i'm total newb here) but any > insight from the experienced would be great. > > Thanks, > > Rhett > > On Feb 4, 2009 12:13am, Rhett Garber wrote: >> Oh awesome. That's much better. Getting about 15 megs a minute now. >> >> >> >> Rhett >> >> >> >> On Wed, Feb 4, 2009 at 12:07 AM, Ulises ulises.cervino@gmail.com> >> wrote: >> >> >> Loading in the couchdb, i've only got 30 megs in the last hour. >> That >> >> >> 30 megs has turned into 389 megs in the couchdb data file. That >> >> >> doesn't seem like enough disk IO to cause this sort of delay..... >> >> >> where is the time going ? network ? >> >> > >> >> > Are you uploading one document at a time or using bulk updates? >> You do >> >> > this using update([doc1, doc2,...]) in couchdb-python. >> >> > >> >> > HTH, >> >> > >> >> > U >> >> > >>