couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Norman Barker <norman.bar...@gmail.com>
Subject Re: multiview on github
Date Tue, 21 Sep 2010 15:56:21 GMT
Paul,

fantastic, thanks for the feedback and you aren't b*tching this is
what I wanted, comments inline.

On Tue, Sep 21, 2010 at 9:37 AM, Paul Davis <paul.joseph.davis@gmail.com> wrote:
> Norman,
>
> Sorry its taken me so long to review this code. In its current form I
> would have to -1 adding the current implementation to trunk for a
> couple reasons. I'm roughly +0 on the general outline of the algorithm
> for future inclusion, but I'll discuss that below.
>
> The biggest issue that jumps out is that its unbounded in its use of
> memory. If I'm reading this code correctly, each view/spatial/fti
> query grabs its entire list of document id's and creates a record that
> stores this list. Then you create a ring of processes that then copies
> these lists possibly multiple times and in the worst way as the larger
> the list, the more times its copied. Then inside the ring the queries
> are being re-run for each test of an id being present which is
> confuses me because they could be using the list of id's that were
> calculated during the calls to multiview:query_view_count/3. Granted I
> could be reading this wrong, but its a bit hard to follow in places.
> Also, at least for fti and views, you don't actually need to enumerate
> the entire thing to get a row count as they can both report a count
> efficiently. I'm not sure about spatial, but even if it can't yet, I
> would imagine it could be implemented.
>
<ncb>
only with external is the entire list being held in memory, the code
streams the results, at most one id in memory at any one time.

1) How do you get a row count with a view for a startkey and endkey
that would solve one of my problems?

2) How do you test for document id inclusion in the results of a view?
</ncb>

> And now for a list of nits about mechanics:
>
> The source code for this patch is completely unlike anything else in
> CouchDB. There are lots of differences that add up to make this alone
> reason to prevent it from entering into trunk:
>
> The file headers in source files should be removed and replaced with
> ASF license headers.
> Source files must be less than eighty columns wide.
> You've accidentally committed local_dev.ini and etc/init/couchdb.
<ncb>
Consider that done, I will add this on the next commit.
</ncb>
> I'd like to see more tests in the futon tests.
<ncb>
ok
</ncb>
> If this ends up in trunk, it will not be able to depend on the spatial
> and fti handlers existing if they're not also in trunk. This might be
> solvable with an abstraction that can be dynamically added if they're
> present.
<ncb>
fti and spatial code is only called if the query asks for it, I will
look into this.
</ncb>
> AFAICT, error reporting doesn't seem to exist, and it looks like
> there's a lot of new surface area for generating errors.
> The supervisor/gen_server pattern that's going on here doesn't appear
> to have a reason. As in, I can't see a reason the gen_server even
> needs to exist. Just make the multiview:query call from the HTTP
> process.
<ncb>
ok, it is really unclear in couchdb when to use supervisor,
gen_servers, I wrote multiview as a gen_server since I thought it
similar to an EJB and encapsulated unit of work that I wanted to
delegate tasks to and not hog the HTTP process.

Saying that if couch_query_rings use gen_server delegates as you
recommend below then that will achieve that goal.
</ncb>

> In Erlang, the term Node generally refers to a remote VM. Using the
> variable Node in your query ring code confused me greatly until I
> realized it was just pid's.
<ncb>
ok, I will change this
</ncb>
> You should generally avoid raw message passing in Erlang. Using a
> gen_server for each of the different ring members depending on query
> type would be more appropriate.
<ncb>
ok, I will mod this as well.
</ncb>
> If you are using gen_servers, you should fill out more of the
> callbacks to do meaningful things, ie, logging and/or dying on
> unexpected messages. Silently ignoring that sort of thing can lead to
> very hard to track bugs.
<ncb>
ok
</ncb>
> Module names should be prefixed with couchdb_ if they're going into
> trunk as part of couchdb.
> I'm not sure I like the generous use of pmap and friends. I understand
> that it'd ideally reduce latency, but at the burden of reducing a
> node's ability to handle concurrency. Not sure on the best solution to
> this though.
> In the couple places that have the big case statements for handling
> each type of view query, I'd transform those into functions to make
> things easier to follow.
<ncb>
thanks, I am all for code clarity, thanks for the feedback.
</ncb>
> Support for view parameters is limited to startkey and endkey. At the
> very least, start_docid and end_docid should be supported. The other
> various parameters affecting collation should also probably be
> supported. limit and count would be nice. I'm sure there are probably
> others too, but there are also ones that probably don't need to be
> included.
> How should reduces be handled, if at all? I don't see them being
> handled now, but I can assure you that people will want some sort of
> support if this goes into trunk.
> Passing the view groups between processes does not seem like a good
> idea. I'd have to look back at the view_group code to double check
> that though.
>
> Now I'll stop bitching and tell you that there is actually some hope
> and I'm intrigued where this could go.
>
<ncb>
thanks for the code feedback, this is what I wanted.
</ncb>
> The current algorithm structure you have is pretty interesting. I
> think with a couple improvements it would go along way.
>
> If I were to write this, I would start by cleaning up the row counts
> code to give a quicker response without iterating each query. Once you
> have the row counts, for every query except the largest, iterate over
> the output to generate a bloom filter of id's. contained in that view
> query. Then to send data to the client you just iterate over the
> largest query checking that the id is in each of the bloom filters.
>
<ncb>
I want to clear up the row counts, I don't want to iterate each query,
I am not aware of a method call to get the row_count between two keys
in a view.
</ncb>
> For a NIF version of Bloom filters, check here:
> http://github.com/basho/ebloom There's also a blog post by Jonathan
> Ellis from the Cassandra group that gives some pretty good details on
> Bloom filters: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
> Also there's probably a wikipedia page.
>
<ncb>
this will be an interesting task, I will get on this.
</ncb>
> This layout also gives the ability to perform future optimizations
> that would make things even more quicker. For instance, bloom filters
> could be pre-computed using a built-in reduce function. Adding a
> special call for "row_count_and_filter" would then be super quick to
> get the information you need before iterating just one view which
> could be the smallest view at that point for extra win.
>
> Using a bloom filter check for each query also gives the ability to
> more easily do complex boolean checks among the various view queries.
>
>
> As to whether this should go into trunk, I was talking with Volker the
> other day I was a bit surprised when he mentioned that he wasn't sure
> he wanted the spatial indexer in trunk. He was working on making the
> indexer an external plugin for couchdb without requiring a complete
> fork. I think if we put in the effort to make plugins like this easy
> to bundle and install then there might not be much of a reason to
> include more things in trunk. And if we do go with the plugin route,
> I'd still want these and couchdb-lucene in a contrib section so that
> people feel safer using them so they get tested more so that they can
> eventually go into trunk as built-in features.
>
<ncb>
plugins would be good, but honestly it isn't hard to change local.ini,
With the multiview I would rather see focus on external
http_db_handlers such as FTI and getting them streaming the results
rather than having to write a complete result on one stdio line.

I would like this is trunk mainly because I want to hack on trunk and
to do that I need to be a committer :-) Plugins work fine.
</ncb>
> And I'm going to stop typing now.
>
<ncb>
Many thanks for the constructive feedback.
</ncb>

> Paul Davis
>
> On Fri, Jul 30, 2010 at 5:39 PM, Norman Barker <norman.barker@gmail.com> wrote:
>> Hi,
>>
>> a very initial version of the multiview is at
>> http://github.com/normanb/couchdb-multiview for discussion.
>>
>> The views are intersected by using a ring of processes where each node
>> in the ring represents a view as follows;
>>
>> % send an id from the start list to the next node in the ring, if the
>> id is in adjacent node then this node sends to the next ring node ....
>> % if the id gets all round the ring and back to the start node then it
>> has intersected all queries and should be included. The nodes in the
>> ring
>> % should be sorted in size from small to large for this to be effective
>> %
>> % In addition send the initial id list round in parallel
>>
>> this is implemented in the couch_query_ring module.
>>
>> I have a couple of questions
>>
>> 1) in the module multiview, is there a quicker way to find the counts
>> from startkey to endkey rather than iterating?
>> 2) In the module couch_query_ring is there a quicker way to test for
>> inclusion rather than iterating?
>> 3) Finally, if I hit this concurrently I get an exception,
>>
>> [error] [<0.201.0>] Uncaught error in HTTP request: {exit,
>>                                 {noproc,
>>                                  {gen_server,call,
>>
>> (so ignore my previous email, I am able to trap the msg)
>>
>> I am going to look into (3) but if you have seen this before.
>>
>> I am developing on windows, but also test on linux I will work on
>> getting a linux makefile, but the Makefile.win should be a start.
>>
>> Any help and comments appreciated.
>>
>> Norman
>>
>

Mime
View raw message