lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2564) Integrating grouping module into Solr 4.0
Date Sat, 04 Jun 2011 10:16:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044253#comment-13044253
] 

Michael McCandless commented on SOLR-2564:
------------------------------------------

bq. I haven't been following this... 

Thank you for the review Yonik!

bq. I took a quick look at the patch, and at first blush it's hard to tell what changes are
cleanup and what changes are cut-over.

It's definitely a big refactoring.

bq. in the QueryComponent, why the change to set the GET_SCORES flag based on the sort(s)?

I assume this is because if you sort or groupSort by a Sort that
contains SortField.SCORE, you need a scorer.  Not sure why Solr trunk
gets away with not doing this, though... Martijn do you know?  Oh, and
likely because the CachingCollector needs to know up front if scores
are needed.

bq. I'm not a fan of this new style for matching request parameters to enums... solr does
a lot of lookups on a typical request, and a switch to this style everywhere could definitely
have an impact (the whole upper-casing the request param so we can match it to the enum name).

The .upper() calls (twice per request, only if the request has
group=true) are negligible here (true, other situations might be
different, though we should fix them to lookup/check once).

And, this approach gives us strong checking of the enum fields.
Contrast that with what's on trunk now:

{noformat}
  String format = params.get(GroupParams.GROUP_FORMAT);
  Grouping.Format defaultFormat = "simple".equals(format) ? Grouping.Format.Simple : Grouping.Format.Grouped;

{noformat}

If you have a typo in "simple", you'll unexpectedly get
Grouping.Format.Grouped.

I think we should strongly check all of our request params?

{quote}
"Accuracy" seems a bit mis-named? It seems to imply an accuracy trade-off, but both methods
are 100% accurate here, they just do different things to serve different usecases. At least
the name doesn't seem to have made it's way into the external API though.

The parameter "group.totalCount" I would expect to return the total count of something, not
control the pre/post faceting thing? (or are the comments just wrong?) If it's to return the
number of groups, then perhaps the name should be "group.groupCount" as totalCount is unit-less.
{quote}

I agree.  This setting just controls what the "total hit count" should
count -- unique groups vs docs.

How about TotalCount.GROUPS and TotalCount.DOCS?

bq. What does "group.docSet" do? The comments don't quite make sense to me, but the param
suggests it's sort of like group.totalCount?

I think we should remove this from this patch (see my comment above --
it applies to LUCENE-3097).

bq. in the interest of reducing the number of parameters, we could dump group.cache and have
a single group.cacheMB parameter that uses 0 as no cache, -1 as maximum needed (solr uses
-1 in this manner in other places too), and other values as literal number of MB (which I'd
discourage people from using personally).

+1, that makes sense.

bq. I'm not sure we should default group.cache to true... there's a downside to the memory
usage, and it's fragile: things may be working just fine, and the user may add a few more
documents to the index and then the limit is hit and it just stops working (but still consumes
much memory and extra log warnings per request).

Enabling caching can make a huge improvement in QPS, especially for
queries that are costly to execute but don't match too many docs.

Maybe instead of a fixed MB we could make it a percentage of the
maxDoc?  This would make it less fragile..  So you could say you're
willing to cache up to 50% of the total docs in the index.

bq. FYI: there's a nocommit in there misspelled as "No commit"

Martijn can you fix this one...?  And in general try to spell it as
"nocommit" :) This is what our build catches.  (And, fear not, you're
not the only person to have trouble spelling nocommit!!).


> Integrating grouping module into Solr 4.0
> -----------------------------------------
>
>                 Key: SOLR-2564
>                 URL: https://issues.apache.org/jira/browse/SOLR-2564
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Martijn van Groningen
>            Assignee: Martijn van Groningen
>             Fix For: 4.0
>
>         Attachments: LUCENE-2564.patch, SOLR-2564.patch, SOLR-2564.patch
>
>
> Since work on grouping module is going well. I think it is time to wire this up in Solr.
> Besides the current grouping features Solr provides, Solr will then also support second
pass caching and total count based on groups.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message