Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Apple Message framework v1084)
Subject: Re: Tracking file throughput?
From: Jan Lehnardt <jan@apache.org>
In-Reply-To: <BANLkTinJi6DYcANjmv8OvPv_6_J85JHvNA@mail.gmail.com>
Date: Fri, 3 Jun 2011 16:03:30 +0200
Content-Transfer-Encoding: 7bit
Message-Id: <4A67FD8D-BB49-4EE2-A668-EB68EDD031AC@apache.org>
References: <BANLkTinJi6DYcANjmv8OvPv_6_J85JHvNA@mail.gmail.com>
To: user@couchdb.apache.org

Hi,

On 3 Jun 2011, at 15:43, muji wrote:
> I'm still new to couchdb and nosql so apologies if the answer to this
> is trivial.

No worries, we're all new at something :)

> 
> I'm trying to track the throughput of a file sent via a POST request
> in a couchdb document.
> 
> My initial implementation creates a document for the file before the
> POST is sent and then I have an update handler that increments the
> "uploadbytes" for every chunk of data received from the client.

Could you make that little less frequent in interpolate between the
data points? Instead of tracking bytes exactly at the chunk boundaries,
just update every 10 or so MB? And have the UI adjust accordingly?


> This *nearly* works except that I get document update conflicts (which
> I think is to do with me not being able to throttle back the upload
> while the db is updated) but the main problem is that for large files
> (~2.4GB) the number of document revisions is around 40-50,000. So I
> have a single document taking up between 0.7GB and 1GB. After
> compaction if reduces to ~380KB which of course is much better but
> this still seems excessive and poses problems with compacting to a
> write heavy database. I understand the trick to that is to replicate,
> compact and replicate back to the source, please correct me if I'm
> wrong...

Hm no that won't do anything, just regular compaction is good enough.

> So, I don't think this approach is viable which makes me wonder
> whether setting the _revs_limit will help, although I understand that
> setting this per database still requires compaction and will save on
> space after compaction.

_revs_limit won't help, you will always need to compact to get rid of
data.

> I was thinking that tracking the throughput as chunks in individual
> documents and then calculating the throughput with a map/reduce on all
> the chunks might be a better approach. Although I'm concerned that
> having lots of little documents for each data chunk will also take up
> large amounts of space...

Yeah, wouldn't save any space here. That said, the numbers you quote,
I wouldn't call "large amounts".


> Any advice and guidance on the best way to tackle this would be much
> appreciated.

I'd either set up continuous compaction (restart compaction right when
it is done) to keep DB size at a minimum or use an in-memory store
to keep track of the uploaded bytes.

Ideally though, CouchDB would give you an endpoint to query that kind
of data.

Cheers
Jan
--