lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: More on topic of Meta-search/Federated Search with Solr
Date Tue, 27 Aug 2013 06:03:14 GMT
Dan,

if you're bound to federated search then I would say that you need to work on the service
guarantees of each of the nodes and, maybe, create strategies to cope with bad nodes.

paul


Le 26 août 2013 à 22:57, Dan Davis a écrit :

> First answer:
> 
> My employer is a library and do not have the license to harvest everything
> indexed by a "web-scale discovery service" such as PRIMO or Summon.    If
> our design automatically relays searches entered by users, and then
> periodically purges results, I think it is reasonable from a licensing
> perspective.
> 
> Second answer:
> 
> What if you wanted your Apache Solr powered search to include all results
> from Google scholar to any query?   Do you think you could easily or
> cheaply configure a Zookeeper cluster large enough to harvest and index all
> of Google Scholar?   Would that violate robot rules?    Is it even possible
> to do this from an API perspective?   Wouldn't google notice?
> 
> Third answer:
> 
> On Gartner's 2013 Enterprise Search Magic Quadrant, LucidWorks and the
> other Enterprise Search firm based on Apache Solr were dinged on the lack
> of Federated Search.  I do not have the hubris to think I can fix that, and
> it is not really my role to try, but something that works without
> Harvesting and local indexing is obviously desirable to Enterprise Search
> users.
> 
> 
> 
> On Mon, Aug 26, 2013 at 4:46 PM, Paul Libbrecht <paul@hoplahup.net> wrote:
> 
>> 
>> Why not simply create a meta search engine that indexes everything of each
>> of the nodes.?
>> (I think one calls this harvesting)
>> 
>> I believe that this the way to avoid all sorts of performance bottleneck.
>> As far as I could analyze, the performance of a federated search is the
>> performance of the least speedy node; which can turn to be quite bad if you
>> do not exercise guarantees of remote sources.
>> 
>> Or are the "remote cores" below actually things that you manage on your
>> side? If yes guarantees are easy to manage..
>> 
>> Paul
>> 
>> 
>> Le 26 août 2013 à 22:38, Dan Davis a écrit :
>> 
>>> I have now come to the task of estimating man-days to add "Blended Search
>>> Results" to Apache Solr.   The argument has been made that this is not
>>> desirable (see Jonathan Rochkind's blog entries on Bento search with
>>> blacklight).   But the estimate remains.    No estimate is worth much
>>> without a design.   So, I am come to the difficult of estimating this
>>> without having an in-depth knowledge of the Apache core.   Here is my
>>> design, likely imperfect, as it stands.
>>> 
>>>  - Configure a core specific to each search source (local or remote)
>>>  - On cores that index remote content, implement a periodic delete query
>>>  that deletes documents whose timestamp is too old
>>>  - Implement a custom requestHandler for the "remote" cores that goes
>> out
>>>  and queries the remote source.   For each result in the top N
>>>  (configurable), it computes an id that is stable (e.g. it is based on
>> the
>>>  remote resource URL, doi, or hash of data returned).   It uses that id
>> to
>>>  look-up the document in the lucene database.   If the data is not
>> there, it
>>>  updates the lucene core and sets a flag that commit is required.
>> Once it
>>>  is done, it commits if needed.
>>>  - Configure a core that uses a custom SearchComponent to call the
>>>  requestHandler that goes and gets new documents and commits them.
>> Since
>>>  the cores for remote content are different cores, they can restart
>> their
>>>  searcher at this point if any commit is needed.   The custom
>>>  SearchComponent will wait for commit and reload to be completed.
>> Then,
>>>  search continues uses the other cores as "shards".
>>>  - Auto-warming on this will assure that the most recently requested
>> data
>>>  is present.
>>> 
>>> It will, of course, be very slow a good part of the time.
>>> 
>>> Erik and others, I need to know whether this design has legs and what
>> other
>>> alternatives I might consider.
>>> 
>>> 
>>> 
>>> On Sun, Aug 18, 2013 at 3:14 PM, Erick Erickson <erickerickson@gmail.com
>>> wrote:
>>> 
>>>> The lack of global TF/IDF has been answered in the past,
>>>> in the sharded case, by "usually you have similar enough
>>>> stats that it doesn't matter". This pre-supposes a fairly
>>>> evenly distributed set of documents.
>>>> 
>>>> But if you're talking about federated search across different
>>>> types of documents, then what would you "rescore" with?
>>>> How would you even consider scoring docs that are somewhat/
>>>> totally different? Think magazine articles an meta-data associated
>>>> with pictures.
>>>> 
>>>> What I've usually found is that one can use grouping to show
>>>> the top N of a variety of results. Or show tabs with different
>>>> types. Or have the app intelligently combine the different types
>>>> of documents in a way that "makes sense". But I don't know
>>>> how you'd just get "the right thing" to happen with some kind
>>>> of scoring magic.
>>>> 
>>>> Best
>>>> Erick
>>>> 
>>>> 
>>>> On Fri, Aug 16, 2013 at 4:07 PM, Dan Davis <dansmood@gmail.com> wrote:
>>>> 
>>>>> I've thought about it, and I have no time to really do a meta-search
>>>>> during
>>>>> evaluation.  What I need to do is to create a single core that contains
>>>>> both of my data sets, and then describe the architecture that would be
>>>>> required to do blended results, with liberal estimates.
>>>>> 
>>>>> From the perspective of evaluation, I need to understand whether any
of
>>>>> the
>>>>> solutions to better ranking in the absence of global IDF have been
>>>>> explored?    I suspect that one could retrieve a much larger than N
>> set of
>>>>> results from a set of shards, re-score in some way that doesn't require
>>>>> IDF, e.g. storing both results in the same priority queue and
>> *re-scoring*
>>>>> before *re-ranking*.
>>>>> 
>>>>> The other way to do this would be to have a custom SearchHandler that
>>>>> works
>>>>> differently - it performs the query, retries all results deemed
>> relevant
>>>>> by
>>>>> another engine, adds them to the Lucene index, and then performs the
>> query
>>>>> again in the standard way.   This would be quite slow, but perhaps
>> useful
>>>>> as a way to evaluate my method.
>>>>> 
>>>>> I still welcome any suggestions on how such a SearchHandler could be
>>>>> implemented.
>>>>> 
>>>> 
>>>> 
>> 
>> 


Mime
View raw message