Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of freeformsystems@gmail.com
 designates 209.85.160.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type;
        b=mvG95dnq8/DLto3isgJ+6Xxj1xQVZV/DoCJUdftAuDLULWINOLvB8+5fZKMaLT1Rab
         D4SDegunxrB3Vn/A/ReE3MoIke0QedhbH4IhF1eq8TxWytvAlIInKXcIHci3bmksAw/C
         hl5zXPk29B04KzoeAj9GHjQQaqDF31BmwAHc8=
MIME-Version: 1.0
Sender: freeformsystems@gmail.com
In-Reply-To: <489F0815-89BC-4011-82C3-133A8DBB2323@apache.org>
References: <BANLkTinJi6DYcANjmv8OvPv_6_J85JHvNA@mail.gmail.com>
	<4A67FD8D-BB49-4EE2-A668-EB68EDD031AC@apache.org>
	<BANLkTim5Hxm-GmfNrYAj7TO5Q-HDZduC9g@mail.gmail.com>
	<489F0815-89BC-4011-82C3-133A8DBB2323@apache.org>
Date: Fri, 3 Jun 2011 16:00:21 +0100
Message-ID: <BANLkTin98MgYw91M-OVpxnmMvTwfgb0Nmg@mail.gmail.com>
Subject: Re: Tracking file throughput?
From: muji <mischa@freeformsystems.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8

Thanks again for your help Jan.

Sorry, I thought that continuous compaction might be a feature I had
overlooked. I have no problems automating a compaction process, I
always envisaged needing to do that...

I think that I will revert to running far fewer updates on the couchdb
document and caching the throughput in Redis as disc space is more of
a priority than application complexity.

A few more (different) questions in the pipeline as I'm still learning couch ;)

On Fri, Jun 3, 2011 at 3:37 PM, Jan Lehnardt <jan@apache.org> wrote:
>
> On 3 Jun 2011, at 16:28, muji wrote:
>
>> Thanks very much for the help.
>>
>> I could of course reduce the amount of times the update is done but
>> the service plans to bill based on throughput so this is quite
>> critical from a billing perspective.
>
> You can still bill on throughput as you will know exactly how much
> date has been transferred in what amount of time, but reporting is
> going to be less granular, i.e. chunks of say 10MB and not 100Kb or
> however big chunks are.
>
>> A quick search for continuous compaction didn't yield anything, and I
>> don't see anything here:
>>
>> http://wiki.apache.org/couchdb/Compaction
>>
>> Could you point me in the right direction please?
>
> I made it up and I explained how to do it. Pseduocode:
>
> while(`curl http://127.0.0.1:5984/db/_compact`);
>
>> Funny you mention about caching before updating couch, that was my
>> very first implementation! I was updating Redis with the throughput
>> and then updating the file document once the upload completed. That
>> worked very well but I wanted to remove Redis from the stack as the
>> application is already pretty complex.
>>
>> I'm guessing my best option is to revert back to that technique?
>
> It depends on what your goals are. The initial design you mentioned
> seems fine to me if you compact often. If you are optimising for
> disk space, Redis or memcached may be a good idea. If you are
> optimising for a small stack, not having Redis or memcached is a
> good idea.
>
>> As an aside, why would my document update handler be raising
>> conflicts? My understanding was that update handlers would not raise
>> conflicts - is that correct?
>
> That is not correct.
>
> Cheers
> Jan
> --
>
>>
>> Thanks!
>>
>> On Fri, Jun 3, 2011 at 3:03 PM, Jan Lehnardt <jan@apache.org> wrote:
>>> Hi,
>>>
>>> On 3 Jun 2011, at 15:43, muji wrote:
>>>> I'm still new to couchdb and nosql so apologies if the answer to this
>>>> is trivial.
>>>
>>> No worries, we're all new at something :)
>>>
>>>>
>>>> I'm trying to track the throughput of a file sent via a POST request
>>>> in a couchdb document.
>>>>
>>>> My initial implementation creates a document for the file before the
>>>> POST is sent and then I have an update handler that increments the
>>>> "uploadbytes" for every chunk of data received from the client.
>>>
>>> Could you make that little less frequent in interpolate between the
>>> data points? Instead of tracking bytes exactly at the chunk boundaries,
>>> just update every 10 or so MB? And have the UI adjust accordingly?
>>>
>>>
>>>> This *nearly* works except that I get document update conflicts (which
>>>> I think is to do with me not being able to throttle back the upload
>>>> while the db is updated) but the main problem is that for large files
>>>> (~2.4GB) the number of document revisions is around 40-50,000. So I
>>>> have a single document taking up between 0.7GB and 1GB. After
>>>> compaction if reduces to ~380KB which of course is much better but
>>>> this still seems excessive and poses problems with compacting to a
>>>> write heavy database. I understand the trick to that is to replicate,
>>>> compact and replicate back to the source, please correct me if I'm
>>>> wrong...
>>>
>>> Hm no that won't do anything, just regular compaction is good enough.
>>>
>>>> So, I don't think this approach is viable which makes me wonder
>>>> whether setting the _revs_limit will help, although I understand that
>>>> setting this per database still requires compaction and will save on
>>>> space after compaction.
>>>
>>> _revs_limit won't help, you will always need to compact to get rid of
>>> data.
>>>
>>>> I was thinking that tracking the throughput as chunks in individual
>>>> documents and then calculating the throughput with a map/reduce on all
>>>> the chunks might be a better approach. Although I'm concerned that
>>>> having lots of little documents for each data chunk will also take up
>>>> large amounts of space...
>>>
>>> Yeah, wouldn't save any space here. That said, the numbers you quote,
>>> I wouldn't call "large amounts".
>>>
>>>
>>>> Any advice and guidance on the best way to tackle this would be much
>>>> appreciated.
>>>
>>> I'd either set up continuous compaction (restart compaction right when
>>> it is done) to keep DB size at a minimum or use an in-memory store
>>> to keep track of the uploaded bytes.
>>>
>>> Ideally though, CouchDB would give you an endpoint to query that kind
>>> of data.
>>>
>>> Cheers
>>> Jan
>>> --
>>>
>>>
>>
>>
>>
>> --
>> muji.
>
>


-- 
muji.