incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Cottlehuber <d...@muse.net.nz>
Subject Re: Problems with CouchDB 1.2.0 views on large documents JSONs
Date Mon, 04 Jun 2012 19:48:33 GMT
On 4 June 2012 21:03, Francesco Furiani <fra.furiani@stud.uniroma3.it> wrote:
> Hi,
>
> i run a couchdb server (v1.2.0) over a mac (intel architecture, 8gb of ram,
> os x version 10.6.8) installed with brew.
>
> The server itself is used as a storage of big jsons (example:
> https://raw.github.com/cvdlab-bio/webpdb/develop/docs/jsons/2LGB-pretty-print.json
> ) for a tiny uni project.
>
> When we load more than 3 of these jsons, all the map functions (we created
> to retrieve documents besides a simple get by id) does not work.
> A typical map is:
>
> function(doc){if(doc.TITLE.title.match('.*INSULIN.*') !== null) emit(doc.ID,
> doc);}
>
> but even a
>
> function(doc){emit(doc.ID, doc.ID)}
>
> cease to work.
>
> while when there are just 3 or 2 jsons in the database they work just fine.
> I tried increasing the stack for couchjs (1gb now, going over 1gb doesn't
> work it seems), increasing limits for files (4096), increasing timeout for
> processes but in the end i don't get any results and only a (Error:
> os_process_error {exit_status,0}) from the db.
>
> Is the json we provide too big for couch? We need to redisign map to remove
> parts for json? Is this a known bug (but i haven't found anything over the
> net)?
>
> Any clue that might help me?
>
> Thanks for the help,
> Francesco
>

Hi Francesco,

CouchDB stores JSON in a native erlang format on disk. Retrieving this
(whether to process in a JS map/reduce view, or to send through to an
http client) requires transforming this into JSON text format. For big
docs, this can take a while, or even when piped into couchjs, break.
There's a couple of other people who have reported this type of issue
recently on the ML.

You could avoid this by using erlang views**, or you may check whether
you see the same issue in 1.1.1 which has a different (slower) JSON
parsing tool.

Could you open a JIRA ticket for this issue please, seeing as you have
a nice sample doc to share?

Some general points:
typically you can replace emit(doc.id, doc) with emit(null) in your view.
You can always use ?include_docs=true to return the full data files in
your query.
The id of any doc emitted is available "for free" so you don't need
the duplication.
This will make your view significantly smaller by orders of magnitude.

** erlang views run inside the erlang vm, without a trusted sandbox rm
-rf and worse are all possible. But its likely faster, less
limitations per above issue, and comes with less documentation too.
YMMV, don't forget to wear a seatbelt, and never _ever_ run with
scissors.

A+
Dave

Mime
View raw message