Return-Path: X-Original-To: apmail-lucene-general-archive@www.apache.org Delivered-To: apmail-lucene-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 84FA99D58 for ; Mon, 17 Oct 2011 14:34:39 +0000 (UTC) Received: (qmail 21921 invoked by uid 500); 17 Oct 2011 14:34:39 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 21834 invoked by uid 500); 17 Oct 2011 14:34:39 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 21826 invoked by uid 99); 17 Oct 2011 14:34:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Oct 2011 14:34:39 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of a.shneyderman@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vw0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Oct 2011 14:34:33 +0000 Received: by vws7 with SMTP id 7so3781089vws.35 for ; Mon, 17 Oct 2011 07:34:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=Cb8gjNo86wVXct/05IS2bN6cXUxebK3x/ILvGFh2WhA=; b=PwOXK/B6DfQqnBcw5UTCxcLbP20Rb/rOZz47KS3PiZ55UAy9wCFgBOufPsxGEjtEkO UIRE/n7Zj2C1VXLfl4slkYrwErgATfC/qEs7Z9YJdj8EiEwiKrcqR1dTfnmPdoJDt+HU 3cyv+BN0Q97AbA4vJRdJhe4zhG19rMeQEBsog= MIME-Version: 1.0 Received: by 10.52.178.163 with SMTP id cz3mr5060033vdc.43.1318862050491; Mon, 17 Oct 2011 07:34:10 -0700 (PDT) Received: by 10.52.155.100 with HTTP; Mon, 17 Oct 2011 07:34:10 -0700 (PDT) In-Reply-To: <1318860778.18434.YahooMailNeo@web130108.mail.mud.yahoo.com> References: <1318860778.18434.YahooMailNeo@web130108.mail.mud.yahoo.com> Date: Mon, 17 Oct 2011 16:34:10 +0200 Message-ID: Subject: Re: Suggestions or best practices for indexing the logs From: Alex Shneyderman To: general@lucene.apache.org, Otis Gospodnetic Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Otis, Not sure I understand. Could you elaborate? Note, content is not stored in the index itself. Hence my confusion to your suggestion. Thanks, Alex. On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic wrote: > Alex, > > You could try compressing the content field - that might help a bit. > > Otis > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > >>________________________________ >>From: Alex Shneyderman >>To: general@lucene.apache.org >>Sent: Thursday, October 13, 2011 7:21 PM >>Subject: Suggestions or best practices for indexing the logs >> >>Hello, everybody! >> >>I am trying to introduce faster searches to our application that sifts >>through the logs. And Lucene seems to be the tool to use here. The one >>peculiarity of the problem it seems there are few files and they >>contain many log statements. I avoid storing the text in the index >>itself. Given all this I setup indexing as follows: >> >>I iterate over a log file and for each statement in the log file I do >>the indexing of the statements content. >> >>Here is the java code that does field additions: >> >>=A0 =A0 =A0 =A0 =A0 =A0 NumericField startOffset =3D new NumericField("so= ", >>Field.Store.YES, false); >>=A0 =A0 =A0 =A0 =A0 =A0 startOffset.setLongValue( statement.getStartOffse= t() ); >>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(startOffset); >> >>=A0 =A0 =A0 =A0 =A0 =A0 NumericField endOffset =3D new NumericField("eo", >>Field.Store.YES, false); >>=A0 =A0 =A0 =A0 =A0 =A0 endOffset.setLongValue( statement.getEndOffset() = ); >>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(endOffset); >> >>=A0 =A0 =A0 =A0 =A0 =A0 NumericField timestampField =3D new NumericField(= "ts", >>Field.Store.YES, true); >>=A0 =A0 =A0 =A0 =A0 =A0 timestampField.setLongValue(statement.getStatemen= tTime().getTime()); >>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(timestampField); >> >>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(new Field("fn", fileTagName, Field.Store.= YES, >>Field.Index.NO )); >>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(new Field("ct", statement.getContent(), >>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO)); >> >>I am getting following results (index size vs log files) with this scheme= : >> >>The size of the logs is 385MB. >>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs >>385=A0 =A0 =A0/var/tmp/logs >> >> >>The size of the index is 143MB. >>(00:41:26) /var/tmp/index > du -ms /var/tmp/index >>143=A0 =A0 =A0/var/tmp/index >> >>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too >>much (I would expect something like 1/5 - 1/7 for the index)? Is there >>anything I can do to move this to the desired ration? Of course what >>would help is the words histogram and here the top of the output of >>the words histogram script that I ran on the logs: >> >>Total number of words: 26935271 >>Number of different words: 551981 >>The most common words are: >>as=A0 =A0 =A0 3395203 >>10=A0 =A0 =A0 797708 >>13=A0 =A0 =A0 797662 >>2011=A0 =A0 795595 >>at=A0 =A0 =A0 787365 >>timer=A0 =A0746790 >>... >> >>Could anyone suggest a better way to organize index for my logs? And >>by better I mean more compact. Or this is as good as it gets? I tried >>to optimize and got a 2Mb improvement (index went from 145Mb to >>143Mb). >> >>Could anyone point to an article that deals with indexing of logs? Any >>help, suggestions and pointers are greatly appreciated. >> >>Thanks for any and all help and cheers, >>Alex. >> >> >>