Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 6935 invoked from network); 15 Jan 2009 10:32:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Jan 2009 10:32:59 -0000 Received: (qmail 62972 invoked by uid 500); 15 Jan 2009 10:32:55 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 62943 invoked by uid 500); 15 Jan 2009 10:32:55 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 62931 invoked by uid 99); 15 Jan 2009 10:32:55 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2009 02:32:55 -0800 X-ASF-Spam-Status: No, hits=2.6 required=10.0 tests=DNS_FROM_OPENWHOIS,SPF_HELO_PASS,SPF_PASS,WHOIS_MYPRIVREG X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Jan 2009 10:32:47 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1LNPWU-0002hx-QV for solr-user@lucene.apache.org; Thu, 15 Jan 2009 02:32:26 -0800 Message-ID: <21475019.post@talk.nabble.com> Date: Thu, 15 Jan 2009 02:32:26 -0800 (PST) From: philmccarthy To: solr-user@lucene.apache.org Subject: Re: Indexing the same data in many records In-Reply-To: <596322.15120.qm@web50310.mail.re2.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: philmccarthy@gmail.com References: <21448465.post@talk.nabble.com> <446894.22198.qm@web50307.mail.re2.yahoo.com> <21468706.post@talk.nabble.com> <596322.15120.qm@web50310.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Hi, Adding same document many times is actually the scenario I wanted to test--indexing hits from Apache webserver logs with the source of the referring page. My expectation would be that the majority of hits on a given day would originate from a small number of referrers, so each of these referring pages would be indexed multiple times. I really wanted to check that this would scale better than indexing the same number of different documents--your explanation regarding term distribution explains why this is the case. Many thanks, Phil Otis Gospodnetic wrote: > > Phil, > > Note that adding the same document multiple times and looking at the index > size is not a very good approach. You are adding a fixed number of > distinct terms over and over. In real-life scenario you will have a much > greater term distribution, and that will affect index size. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: philmccarthy >> To: solr-user@lucene.apache.org >> Sent: Wednesday, January 14, 2009 7:36:38 PM >> Subject: Re: Indexing the same data in many records >> >> >> Thanks Otis. I tweaked the Solr example app a little and then uploaded a >> ~55KB document to it a couple of thousand times (changing the ID each >> time). >> The solr/data directory was 72MB on disc after adding the document 2000 >> times, so it seems that the index is growing by approximately 36KB for >> each >> document. That seems reasonable. >> >> I guess I need to do some research into expected data volumes now, and >> limits on Lucene index size. >> >> Cheers, >> Phil >> >> >> Otis Gospodnetic wrote: >> > >> > Phil, >> > >> > From what you described so far, I don't see any red flags. I would pay >> > attention to reading those timestamps (covered on the Wiki and ML >> > archives), that's all. >> > >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> > >> > >> > >> > ----- Original Message ---- >> >> From: philmccarthy >> >> To: solr-user@lucene.apache.org >> >> Sent: Tuesday, January 13, 2009 8:49:33 PM >> >> Subject: Indexing the same data in many records >> >> >> >> >> >> Hi, >> >> >> >> I'd like to use Solr to index some webserver logs, in order to allow >> easy >> >> ad-hoc querying and analysis. Each Solr Document will represent a >> single >> >> request to the webserver, with fields for time, request URL, referring >> >> URL >> >> etc. >> >> >> >> I'm also planning to fetch the page source of each referring URL, and >> add >> >> that as an indexed field in the Solr document. The aim is to allow >> >> queries >> >> like "find hits to /xyz.html where the referring page contains the >> word >> >> 'foobar'". >> >> >> >> Since hundreds or even thousands of hits may all come from the same >> >> referring page, would this approach be horribly inefficient? (Note the >> >> page >> >> source won't be stored in each Document, just indexed). Am I going to >> >> dramatically increase the index size if I do this? >> >> >> >> If so, is there a more elegant way to do what I want? >> >> >> >> Many thanks, >> >> Phil >> >> >> >> >> >> >> >> -- >> >> View this message in context: >> >> >> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21448465.html >> >> Sent from the Solr - User mailing list archive at Nabble.com. >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21468706.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/Indexing-the-same-data-in-many-records-tp21448465p21475019.html Sent from the Solr - User mailing list archive at Nabble.com.