lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: solr, snippets and stored field in nutch...
Date Thu, 11 Oct 2007 23:42:29 GMT
On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:

> Hi Mike,
>
> Thanks for your reply :)
>
> I am not an expert of either! But, I understand that Nutch stores
> contents albeit in a separate data structure (they call segment as
> discussed in the thread), but what I meant was that this seems like
> much more efficient way of presenting summaries or snippets (of course
> for apps that need these only) than using a stored field which is only
> option in solr -  not only resulting in a huge index size but reducing
> speed of retrieval because of this increase in size (this is
> admittedly a guess, would like to know if not the case).  Also for
> queries only requesting ids/urls, the segments would never be touched
> even for first n results...

It doesn't slow down querying, but it does slow down document  
retrieval (*if you are never going to request the summaries for those  
documents).  That is the case I was referring to below.

One option that has been kicked around is to have solr support  
dividing the stored fields into multiple lucene indices.  This would  
accomplish the same result as running two Solr servers for the  
purpose, but would be quite complicated to implement.

I could be wrong, though.  Feel free to give it a shot!

-Mike


> Cheers.
> Ravish
>
> On 10/12/07, Mike Klaas <mike.klaas@gmail.com> wrote:
>> First, it should be noted that I am not an expert in Nutch's
>> architure.  I do think I understand what is being said there,  
>> however.
>>
>> Nutch is a distributed web search engine, and uses lucene as a
>> indexing component.  It is free to use external data structures to
>> store data, and can store the index on a different machine than the
>> contents are stored.  They can be updated independently.
>>
>> One reason why this is more efficient is that in a distributed
>> architecture, more documents are retrieved over the system than are
>> eventually summarized and output.  It makes no sense to shovel around
>> the contents of all these documents if summaries are only being
>> returned for the top 10 over the whole system.
>>
>> But Nutch is still storing the contents _somewhere_.  They haven't
>> found a magical technique that makes this need disappear.
>>
>> So, does an external store make sense for Solr? Well, unlike Nutch,
>> Solr is a solitary unit.  If you ask for 10 docs returned, with
>> summaries, all of their contents are going to have to be retrieved.
>> There aren't any advantages to storing the contents in a separate
>> data structure (which will be the same size).
>>
>> Now, if you are using Solr in a large-scale distributed federated
>> way, then you can replicate Nutch's strategy by storing the index in
>> one Solr index, and the contents in another.  This could also yield
>> benefits in a single-machine context if your code access many more
>> documents than it wants summarized.
>>
>> Keep in mind also that Solr has facilities to help you manage the
>> size of the content store.  Are you stripping your contents to their
>> bare minima (removing HTML, etc)?  Are you using a compressed text
>> field (highly recommended for this kind of data)?
>>
>> Believe me, if I found that there was a way of providing summaries
>> without storing doc contents, I would pee my pants with happiness and
>> it would be in Solr faster than you can say "diaper".
>>
>> cheers,
>> -Mike
>>
>> On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:
>>
>>> Hey guys,
>>>
>>> Checkout this thread I opened on nutch mailing list.  Looks like  
>>> Solr
>>> can benefit from reusing Nutch's "segment" based storage strategy  
>>> for
>>> efficiency in returning snippets, summaries etc without using Lucene
>>> stored fields?
>>>
>>> Was this considered before?
>>>
>>> Ravish
>>>
>>> ---------- Forwarded message ----------
>>> From: Dennis Kubes <kubes@apache.org>
>>> Date: Oct 11, 2007 11:27 PM
>>> Subject: Re: snippets and stored field in nutch...
>>> To: nutch-user@lucene.apache.org
>>>
>>>
>>> The reason it is stored in the segments instead of index to allow
>>> summarizers to be run on the content of hits to produce the  
>>> summaries
>>> that appear in the search results.  Summarizers are pluggable and  
>>> the
>>> actual content used to produce the summary can change.  And  
>>> summaries
>>> can be changed without re-fetching or re-indexing.  If a summary  
>>> were
>>> stored in the index, re-indexing would have to occur to make  
>>> changes.
>>>
>>> Also the way the search process works, Nutch returns hits (basically
>>> document ids).  These hits are then sorted and deduped and the  
>>> best x
>>> number (usually 10) returned.  For only these 10 best hits, hit
>>> details
>>> (fields in the index) and summaries are retrieved.  So there is
>>> something to be said about the amount of data being pushed over the
>>> network.
>>>
>>> Dennis Kubes
>>>
>>> Ravish Bhagdev wrote:
>>>> Ah, I see, didn't know that, Thanks!
>>>>
>>>> Interesting that nutch stores it in a different structure  
>>>> (segments)
>>>> and doesn't reuse Lucene strategy of storing within index.  Any
>>>> particular reason why?  Is there any other use of "Segments" data
>>>> structure except to return snippets?
>>>>
>>>> Cheers,
>>>> Ravish
>>>>
>>>> On 10/11/07, John H. Lee <jlee@archive.org> wrote:
>>>>> Hi Ravish.
>>>>>
>>>>> You are correct that Nutch does not store document content in the
>>>>> Lucene index. The content *is* stored in the Nutch segment,  
>>>>> which is
>>>>> where snippets come from.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> -J
>>>>>
>>>>>
>>>>> On Oct 11, 2007, at 12:08 PM, Ravish Bhagdev wrote:
>>>>>
>>>>>> Hey All,
>>>>>>
>>>>>> Am I right in believing that in Lucene/Nutch, to be able to  
>>>>>> return
>>>>>> content or snippet to a search query, the field to be returned
>>>>>> has to
>>>>>> be stored?
>>>>>>
>>>>>> AFAIK, by default, Nutch dose not store the document field, am I
>>>>>> right?  If so, how does it manage to return snippets?   
>>>>>> Wouldn't the
>>>>>> index be quite huge if nutch were storing document field by
>>>>>> default?
>>>>>>
>>>>>> I will appreciate any help/comments as I'm bit lost with this.
>>>>>>
>>>>>> Ravi
>>>>>
>>
>>


Mime
View raw message