couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Freddy Bowen <>
Subject Re: Using CouchDB to represent the tokenized text of a book
Date Wed, 03 Nov 2010 14:24:04 GMT
CouchDB views have a feature called linked documents:

could store each token as a doc.  Then store the order of tokens in
a separate doc.  To change the order of the tokens you'd update the "order"

Consider this position doc:
{ _id:"Genesis-1:1", type:"position",

And these token docs:
  { _id:"token123", type:"token", word:"the"},
  { _id:"token987", type:"token", word:"In"},
  { _id:"token456", type:"token", word:"beginning"}

Then a view like this:
function(doc) {
  if (doc.type=="position") {
    var token=doc.position;
    for (var i=0; i<token.length; i++) {
      emit([doc._id, i], token[i]);

Emits this:
type:"token", word:"In"}},
type:"token", word:"the"}},
type:"token", word:"beginning"}}

Maybe you can make an approach like this work for you?


On Wed, Nov 3, 2010 at 9:16 AM, Dirkjan Ochtman <> wrote:

> On Wed, Nov 3, 2010 at 14:04, Weston Ruter <> wrote:
> > That is a good idea, but the problem with Bible translations in
> particular
> > is the issue of overlapping hierarchies: like chapter and verse don't
> always
> > fall along same divisions as section and paragraph. So the data model
> I've
> > been moving toward is standoff markup, where there is a set of tokens
> > (words, punctuation) for the entire book and then a set of structures
> > (paragraphs, verses, etc) that refer to the start token and end token, so
> > when getting a structure it needs to retrieve all tokens from start to
> end.
> > The use of standoff markup and overlapping hierarchies makes your idea of
> > using sorting buckets not feasible, I don't think. Thanks for the idea
> > though!
> Not sure I agree. My "buckets" are somewhat arbitrary and don't
> actually have to be mapped to any real structure. The trick is just
> that by prefixing with a bucket index, you don't have to update all
> tokens anymore, you only have to update tokens inside the bucket (or
> the next bucket if you happened to be moving a token to the next
> bucket). Your standoff thing (I'm not really used to that term, so no
> clue if I'm using it correctly) would still work, only you now
> reference tokens by bucket and token index, not just token index.
> Cheers,
> Dirkjan

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message