incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kevin Coombes <kevin.r.coom...@gmail.com>
Subject overlap query
Date Fri, 30 Jul 2010 21:45:08 GMT
Hi,

I'm trying to figure out the best way to implement a query for 
"overlapping segments".

The specific use case involves (biological) genomic data, which is 
naturally represented by a triple of the form [Chromosome, Start, End].  
As a concrete example, the index [1, 123456, 135789] represents the 
segment on chromosome 1 that extends from base position 123456 through 
(and including) base position 135789.  The segments/documents in CouchDB 
came from analyzing a set of cell line DNA data to determine segments 
where the copy number changes.

A typical query against this database (from a biologist's point of view) 
would be to ask what happens to these cell lines in the region of a 
specific gene.  I can easily convert gene names to their positions in 
the human genome, so this translates to a query asking for all segments 
that overlap with the region that defines the gene.  For example, I 
might want to find all segments that overlap [1, 130000, 140000].  The 
example above should be returned as part of te results of this query.

The pseudocode for the query I have in mind is something like
    if (doc.Chromosome == query.Chromosome) {
       if (doc.Start <= query.End & doc.End >= query.Start) {
          // show me this document
       }
    }
The actual view at present is much simpler, basically consisting of
    if (doc.Start) {
       emit([doc.Chromosome, doc.Start, doc.End], other-relevant-stuff)
    }
with the idea being that the query parameters should be able to find the 
desired segments.

The problem I have is that I cannot see a reasonable way to use the 
startkey and endkey parameters to identify these kinds of overlaps.  Am 
I missing something, or is there a way within the CouchDB API to do what 
I want?

(One might note that the database arising from 175 cell lines contains 
about 300,000 documents, and that you expect the results of most queries 
to contain onyl about 175 rows (one per cell line).  This may constrain 
the kinds of tricks one can expect to do with additional views or with 
emitting more stuff.)

Thanks,
    Kevin

Mime
View raw message