lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-303) Federated Search over HTTP
Date Tue, 18 Sep 2007 22:25:45 GMT

    [ https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528593
] 

Stu Hood commented on SOLR-303:
-------------------------------

I've been working with the most recent version of the patch some more, and have run into some
more issues. Since I'm sure that you have been working on the patch on your own, I don't want
you to have to dig through my changes as a diff. Instead I'll just try and point them out
for your revision.

We have a few fields that are indexed as strings that contain characters like '@' and ':'.
There are still a few places having to do with the 'df' parameter where these need to be escaped/worked
around, but here is what I've found so far:
* During the iteration over the document's uniqFields in SecondQPhaseComponent.createSecondPhaseParams
** Surrounded the value in "quotes"
* During the iteration over strTerms in MultiSearchRequestHandler.buildQuery
** Modified the split on '@' to only split on the last '@' in the string.
** Modified the split on ':' to split into a maximum of 2 pieces.
* During the iteration over extractedTerms in GlobalCollectionStatComponent.calcuateGlobalCollectionStat
** Modified the split on ':' to split into a maximum of 2 pieces.


I also ran into some problems in other areas:
* XMLResponseParser.parse(url, params) fails to parse a response if it is indented using the
'indent=on' parameter, which gets passed through to the subqueries
** Stripped out 'indent' during the iteration over the params (but there is probably a better
solution to this issue)
* SecondQPhaseComponent.createSecondPhaseParams passes the 'start' parameter through to the
subqueries, which leads to a null pointer when we are querying for specific unique ids.
** Stripped out 'start' during the iteration over the params


I'll keep looking for the last few 'df' issues. Thanks a lot for the patch!

> Federated Search over HTTP
> --------------------------
>
>                 Key: SOLR-303
>                 URL: https://issues.apache.org/jira/browse/SOLR-303
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Sharad Agarwal
>            Priority: Minor
>         Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> ------------------------
> - The client applications are totally agnostic to federated search. The federated search
and merging of results are totally behind the scene in Solr in request handler . Response
format remains the same after merging of results.
> The response from individual shard is deserialized into SolrQueryResponse object. The
collection of SolrQueryResponse objects are merged to produce a single SolrQueryResponse object.
This enables to use the Response writers as it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated only for
merged documents. The query is executed in 2 phases. First phase gets the doc unique keys
with sort criteria. Second phase brings all requested fields and highlighting information.
This saves lot of CPU in case there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can specify to execute
query in just 1 phase itself. (For some queries when highlighting info is not required and
number of fields requested are small; this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate plugins
and request parameters. As federated search is performed by the RequestHandler itself, multiple
request handlers can easily be pre-configured with different federated search settings in
solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be queried.
Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different implementation could
be written based on JSON, xml SAX etc. Current one based on XML DOM.
> HOW:
> -------
> A new RequestHandler called MultiSearchRequestHandler does the federated search on multiple
sub-searchers, (referred as "shards" going forward). It extends the RequestHandlerBase. handleRequestBody
method in RequestHandlerBase has been divided into query building and execute methods. This
has been done to calculate global numDocs and docFreqs; and execute the query efficiently
on multiple shards.
> All the "search" request handlers are expected to extend MultiSearchRequestHandler class
in order to enable federated capability for the handler. StandardRequestHandler and DisMaxRequestHandler
have been changed to extend this class.
>  
> The federated search kicks in if "shards" is present in the request parameter. Otherwise
search is performed as usual on the local index. eg. shards=local,host1:port1,host2:port2
will search on the local index and 2 remote indexes. The search response from all 3 shards
are merged and serviced back to the client. 
> The search request processing on the set of shards is performed as follows:
> STEP 1: The query is built, terms are extracted. Global numDocs and docFreqs are calculated
by requesting all the shards and adding up numDocs and docFreqs from each shard.
> STEP 2: (FirstQueryPhase) All shards are queried. Global numDocs and docFreqs are passed
as request parameters. All document fields are NOT requested, only document uniqFields and
sort fields are requested. MoreLikeThis and Highlighting information are NOT requested.
> STEP 3: Responses from FirstQueryPhase are merged based on "sort", "start" and "rows"
params. Merged doc uniqField and sort fields are collected. Other information like facet and
debug is also merged.
> STEP 4: (SecondQueryPhase) Merged doc uniqFields and sort fields are grouped based on
shards. All shards in the grouping are queried for the merged doc uniqFields (from FirstQueryPhase),
highlighting and moreLikeThis info.
> STEP 5: Responses from all shards from SecondQueryPhase are merged.
> STEP 6: Document fields , highlighting and moreLikeThis info from SecondQueryPhase are
merged into FirstQueryPhase response.
> TODO:
> -Support sort field other than default score
> -Support ResponseDocs in writers other than XMLWriter
> -Http connection timeouts
> OPEN ISSUES;
> -Merging of facets by "top n terms of field f" 
> Scope for Performance optimization:-
> -Search shards in parallel threads
> -Http connection Keep-Alive ?
> -Cache global numDocs and docFreqs
> -Cache Query objects in handlers ??
> Would appreciate feedback on my approach. I understand that there would be lot things
I might have over-looked. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message