lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From philmccarthy <philmccar...@gmail.com>
Subject Re: Indexing the same data in many records
Date Thu, 15 Jan 2009 10:32:26 GMT

Hi,

Adding same document many times is actually the scenario I wanted to
test--indexing hits from Apache webserver logs with the source of the
referring page.

My expectation would be that the majority of hits on a given day would
originate from a small number of referrers, so each of these referring pages
would be indexed multiple times. I really wanted to check that this would
scale better than indexing the same number of different documents--your
explanation regarding term distribution explains why this is the case.

Many thanks,
Phil


Otis Gospodnetic wrote:
> 
> Phil,
> 
> Note that adding the same document multiple times and looking at the index
> size is not a very good approach.  You are adding a fixed number of
> distinct terms over and over.  In real-life scenario you will have a much
> greater term distribution, and that will affect index size.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: philmccarthy <philmccarthy@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, January 14, 2009 7:36:38 PM
>> Subject: Re: Indexing the same data in many records
>> 
>> 
>> Thanks Otis. I tweaked the Solr example app a little and then uploaded a
>> ~55KB document to it a couple of thousand times (changing the ID each
>> time).
>> The solr/data directory was 72MB on disc after adding the document 2000
>> times, so it seems that the index is growing by approximately 36KB for
>> each
>> document. That seems reasonable.
>> 
>> I guess I need to do some research into expected data volumes now, and
>> limits on Lucene index size.
>> 
>> Cheers,
>> Phil
>> 
>> 
>> Otis Gospodnetic wrote:
>> > 
>> > Phil,
>> > 
>> > From what you described so far, I don't see any red flags.  I would pay
>> > attention to reading those timestamps (covered on the Wiki and ML
>> > archives), that's all.
>> > 
>> > 
>> > Otis
>> > --
>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> > 
>> > 
>> > 
>> > ----- Original Message ----
>> >> From: philmccarthy 
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tuesday, January 13, 2009 8:49:33 PM
>> >> Subject: Indexing the same data in many records
>> >> 
>> >> 
>> >> Hi,
>> >> 
>> >> I'd like to use Solr to index some webserver logs, in order to allow
>> easy
>> >> ad-hoc querying and analysis. Each Solr Document will represent a
>> single
>> >> request to the webserver, with fields for time, request URL, referring
>> >> URL
>> >> etc.
>> >> 
>> >> I'm also planning to fetch the page source of each referring URL, and
>> add
>> >> that as an indexed field in the Solr document. The aim is to allow
>> >> queries
>> >> like "find hits to /xyz.html where the referring page contains the
>> word
>> >> 'foobar'".
>> >> 
>> >> Since hundreds or even thousands of hits may all come from the same
>> >> referring page, would this approach be horribly inefficient? (Note the
>> >> page
>> >> source won't be stored in each Document, just indexed). Am I going to
>> >> dramatically increase the index size if I do this?
>> >> 
>> >> If so, is there a more elegant way to do what I want?
>> >> 
>> >> Many thanks,
>> >> Phil
>> >> 
>> >> 
>> >> 
>> >> -- 
>> >> View this message in context: 
>> >> 
>> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21475019.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message