incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Sante <tom.sa...@gmail.com>
Subject Re: couchdb for genome data
Date Thu, 04 Mar 2010 09:46:38 GMT
There is gonna be some partitioning of the data and using a faster  
view server might help to. The only issue I have left is that I can't  
use too many views because storing the generated views of that many  
doucuments will take lots of disk space. Than again disks are cheap  
and easy to add and acceptable trade off for fast queries. And if I  
use a key of like doc.type + doc.experiment + doc.genome_position than  
that could also limit the need for more than one view.

Tom

Op 4-mrt-2010 om 10:33 heeft km <srikrishnamohan@gmail.com> het  
volgende geschreven:\

> Hi,
>
> No it would be fast.
> All the documents are indexed as per views in the database.
> temporary views will have to search each and every document in the  
> database.
> but permanebt views (saved views) will have to only do that for the  
> first
> time. That first time, couchdb would start  searching  all docs and  
> indexes
> according to the view  in the database.
> Once indexed, accessing the same view will instantly retrieve results.
> (This first time indexing would take a bit of time if ur database has
> billions of docs- probably u can also partition them into different
> databases according to category)
>
> Also it would update view indexes if new documents added/removed
> automatically -without changing the views.
>
> Its like having static views with dynamic data.
>
> Krishna
> ~~~ 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~`
>
> On Thu, Mar 4, 2010 at 6:08 PM, Tom Sante <tom.sante@gmail.com> wrote:
>
>> Indeed using "type" could be the alternative to mongoDB  
>> collections. But my
>> question is if I have billions of documents in the DB would this  
>> make view
>> generation very slow and take up lot of disk space just to be able  
>> to search
>> for all probes with a certain experiment_id. Like I said the data is
>> structured in experiments so almost all queries and changes to the  
>> data will
>> be within an experiment with no need to act on the huge amount of  
>> probes
>> from the other experiments.
>> Thanks,
>> Tom
>>
>> Op 4-mrt-2010 om 08:35 heeft km <srikrishnamohan@gmail.com> het  
>> volgende
>> geschreven:
>>
>> Hi,
>>>
>>> You could have an additional key in the document identifying it as  
>>> probe -
>>> eg "type" (key) with value  "probe" like this:
>>>
>>> {
>>>     "type":"probe".
>>>     "probe_id" : 1234567890,
>>>     "experiment_id" : 1234567890,
>>>     "raw_value" : 0.43524,
>>>     "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
>>> }
>>>
>>> so all your probe documents would contain a key called  "type" set  
>>> to
>>> "probe". you can identify only these documents with this key.
>>> Now when u design a view to search probe documents alone, u could  
>>> use a
>>> simple filter statement like this:
>>> if(doc.type=='probe'){ do something ...}
>>> this will only search/index probe type documents.
>>>
>>> NOTE: "type" is not a user defined key just like any other key - u  
>>> can use
>>> anyother name for it !
>>>
>>> U might have other types of documents for which the type keyword  
>>> will
>>> differ
>>> accordingly.
>>> Here there is no need to explicitly define a collection as in  
>>> Mongodb.
>>> All JSON documents could be stored in a single database.
>>>
>>> HTH,
>>> Krishna
>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> On Thu, Mar 4, 2010 at 7:21 AM, Tom Sante <tom.sante@gmail.com>  
>>> wrote:
>>>
>>> Hi
>>>>
>>>> The data is now stored in a mysql table with about a billion (1000
>>>> million)
>>>> rows.
>>>> These rows are the data of a genetic test (arrayCGH) and build up  
>>>> like
>>>> this:
>>>>
>>>> Every experiment (a few thousand of them total) contains  
>>>> measurements of
>>>> about 180000 genetic probes. This raw data will be analyzed and the
>>>> values
>>>> run through different algorithms, so every probe needs to store  
>>>> more than
>>>> 1
>>>> value after the analysis is done. The values of different  
>>>> analysis are
>>>> now
>>>> stored in columns in that table making it a pain if we have to  
>>>> add a
>>>> analysis to the table not yet part of the existing columns. This  
>>>> is why a
>>>> schema free document based DB is probably a better fit.
>>>> The initial idea was to give each probe a separate document, and  
>>>> when the
>>>> original value is transform to an other value store this in the  
>>>> same
>>>> document.
>>>>
>>>> {
>>>>     "probe_id" : 1234567890,
>>>>     "experiment_id" : 1234567890,
>>>>     "raw_value" : 0.43524,
>>>>     "analysis": { "cbs" : 0.436, "CBS+GLAD" : 0.4356 }
>>>> }
>>>>
>>>> Once added to the database almost all changes to the data will be
>>>> contained
>>>> within an experiment.
>>>>
>>>> MongoDB has something like collections that would be a appropriate
>>>> abstraction ~ experiment. But in couchdb I would have to add all  
>>>> these
>>>> probe
>>>> documents in 1 big database without collections. So if I only make
>>>> changes
>>>> to probes within an experiment this would influence the views of  
>>>> all the
>>>> other billions document in the db. Because of the large number of
>>>> documents
>>>> it would be good to know beforehand what the implications are of  
>>>> this
>>>> performance wise?
>>>>
>>>> Any suggestions are welcome.
>>>>
>>>> Tom
>>>>
>>>>

Mime
View raw message