Mailing-List: contact general-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of a.shneyderman@gmail.com
 designates 209.85.212.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <1318860778.18434.YahooMailNeo@web130108.mail.mud.yahoo.com>
References: 
 <CA+smeP+irHRFTdf2Ehb9QihyAMii8UW9Tm92TC-O75Wq2Y7VVw@mail.gmail.com>
	<1318860778.18434.YahooMailNeo@web130108.mail.mud.yahoo.com>
Date: Mon, 17 Oct 2011 16:34:10 +0200
Message-ID: 
 <CA+smePL_REpZwee1_4Oef1UcLZhUahT0C5kUPTWmZV2x0RLUJw@mail.gmail.com>
Subject: Re: Suggestions or best practices for indexing the logs
From: Alex Shneyderman <a.shneyderman@gmail.com>
To: general@lucene.apache.org, Otis Gospodnetic <otis_gospodnetic@yahoo.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Otis,

Not sure I understand. Could you elaborate?

Note, content is not stored in the index itself. Hence my confusion to
your suggestion.

Thanks,
Alex.

On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> Alex,
>
> You could try compressing the content field - that might help a bit.
>
> Otis
> ----
>
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>>________________________________
>>From: Alex Shneyderman <a.shneyderman@gmail.com>
>>To: general@lucene.apache.org
>>Sent: Thursday, October 13, 2011 7:21 PM
>>Subject: Suggestions or best practices for indexing the logs
>>
>>Hello, everybody!
>>
>>I am trying to introduce faster searches to our application that sifts
>>through the logs. And Lucene seems to be the tool to use here. The one
>>peculiarity of the problem it seems there are few files and they
>>contain many log statements. I avoid storing the text in the index
>>itself. Given all this I setup indexing as follows:
>>
>>I iterate over a log file and for each statement in the log file I do
>>the indexing of the statements content.
>>
>>Here is the java code that does field additions:
>>
>>=A0 =A0 =A0 =A0 =A0 =A0 NumericField startOffset =3D new NumericField("so=
",
>>Field.Store.YES, false);
>>=A0 =A0 =A0 =A0 =A0 =A0 startOffset.setLongValue( statement.getStartOffse=
t() );
>>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(startOffset);
>>
>>=A0 =A0 =A0 =A0 =A0 =A0 NumericField endOffset =3D new NumericField("eo",
>>Field.Store.YES, false);
>>=A0 =A0 =A0 =A0 =A0 =A0 endOffset.setLongValue( statement.getEndOffset() =
);
>>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(endOffset);
>>
>>=A0 =A0 =A0 =A0 =A0 =A0 NumericField timestampField =3D new NumericField(=
"ts",
>>Field.Store.YES, true);
>>=A0 =A0 =A0 =A0 =A0 =A0 timestampField.setLongValue(statement.getStatemen=
tTime().getTime());
>>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(timestampField);
>>
>>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(new Field("fn", fileTagName, Field.Store.=
YES,
>>Field.Index.NO ));
>>=A0 =A0 =A0 =A0 =A0 =A0 doc.add(new Field("ct", statement.getContent(),
>>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>>
>>I am getting following results (index size vs log files) with this scheme=
:
>>
>>The size of the logs is 385MB.
>>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>>385=A0 =A0 =A0/var/tmp/logs
>>
>>
>>The size of the index is 143MB.
>>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>>143=A0 =A0 =A0/var/tmp/index
>>
>>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>>anything I can do to move this to the desired ration? Of course what
>>would help is the words histogram and here the top of the output of
>>the words histogram script that I ran on the logs:
>>
>>Total number of words: 26935271
>>Number of different words: 551981
>>The most common words are:
>>as=A0 =A0 =A0 3395203
>>10=A0 =A0 =A0 797708
>>13=A0 =A0 =A0 797662
>>2011=A0 =A0 795595
>>at=A0 =A0 =A0 787365
>>timer=A0 =A0746790
>>...
>>
>>Could anyone suggest a better way to organize index for my logs? And
>>by better I mean more compact. Or this is as good as it gets? I tried
>>to optimize and got a 2Mb improvement (index went from 145Mb to
>>143Mb).
>>
>>Could anyone point to an article that deals with indexing of logs? Any
>>help, suggestions and pointers are greatly appreciated.
>>
>>Thanks for any and all help and cheers,
>>Alex.
>>
>>
>>