incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Mitchell <monch1...@gmail.com>
Subject Uploading CSV data to Couchapp
Date Mon, 04 Apr 2011 07:58:23 GMT
Hello all,

I'm just about to start on my first (wildly ambitious) Couchapp.  I've had
quite a bit of Erlang experience, but not for the past couple of years so
I'm a bit rusty.  I've had a tiny bit of experience with CouchDB via various
Python scripts, but that's all been treating CouchDB as a "black box"
database so I've currently got little knowledge of what it can do beyond
being a document datastore.

Initially, I'm trying to understand my options for uploading CSV files,
parsing out the content and storing them in CouchDB (one CouchDB record per
line of CSV content).  While it's reasonably straightforward to do this if I
was using e.g. Python as a batch load tool, I don't want to go outside
Javascript for this project if I can avoid it.  The CSV files are anywhere
from 1k-30k records, with 8-10 fields in each that are straightforward
timestamps and floating point numbers.

For an old-school Web app with distinct database and app server layers,
there's a straightforward option - upload the data to a file on the web
server, then process the data out of the file and load it into your
database.  Sure there's variations on this approach such as saving data as a
database blob, but I'm looking for the best CouchApp-specific approach if
one exists.

Options I can see:
- upload the data & save it into a single "uploaded_csv" document in
CouchDB.  Within CouchDB, detect the presence of a new "uploaded_csv"
document, extract and process the content using Javascript and save it into
multiple "data" records, with appropriate indexing, then dispose of the
"uploaded_csv" document or mark it as "processed".  This seems reasonably
straightforward, but I'm not sure how to detect the presence of a new
"uploaded_csv" document (is there a cron equivalent in Couch?) and I'd have
to track the progress of processing each uploaded CSV file to detect when
they've been processed into "data" records
- upload the data & save it into a single "uploaded_csv" document in
CouchDB.  Have CouchDB running embedded in an Erlang app, and use Erlang to
read the "uploaded_csv" data, then send a series of e.g. HTTP PUTs to load
the data into multiple "data" records in CouchDB.  This just seems ugly to
me, but I'm pretty confident I could get it working pretty easily
- upload the data and process it directly into "data" records from a web
page served from CouchApp.  This seems like it could impact on scalability
due to having long-running connections between client and server, but at
least a user would know when their data has been uploaded and processed
successfully with trivial extra work on my part
- upload the data, convert it to JSON on the client using clientside
Javascript, then send it as a set of document uploads (i.e. one document per
CSV record) from the client to the Couch server.  This would let me parse
out any bogus CSV content without sending it to the server, but I'll have
users running browsers on mobile devices and I'm not sure I want to put that
processing load onto the client

Are there any "recommended" approaches for this type of task?  I suspect
this question and others I'll ask have probably already been considered and
dealt with by various experts; if there's a "CouchApp cookbook" with
recommended solutions for these and other common situations, I'd appreciate
a pointer to it so I could start to answer my own questions.

Thanks in advance

Dave M.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message