From Alex Shneyderman <>
Subject Re: Suggestions or best practices for indexing the logs
Date Mon, 17 Oct 2011 14:34:10 GMT

Not sure I understand. Could you elaborate?

Note, content is not stored in the index itself. Hence my confusion to
your suggestion.


On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic
<> wrote:
> Alex,
> You could try compressing the content field - that might help a bit.
> Otis
> ----
> Sematext :: :: Solr - Lucene - Nutch
> Lucene ecosystem search ::
>>From: Alex Shneyderman <>
>>Sent: Thursday, October 13, 2011 7:21 PM
>>Subject: Suggestions or best practices for indexing the logs
>>Hello, everybody!
>>I am trying to introduce faster searches to our application that sifts
>>through the logs. And Lucene seems to be the tool to use here. The one
>>peculiarity of the problem it seems there are few files and they
>>contain many log statements. I avoid storing the text in the index
>>itself. Given all this I setup indexing as follows:
>>I iterate over a log file and for each statement in the log file I do
>>the indexing of the statements content.
>>Here is the java code that does field additions:
>>            NumericField startOffset = new NumericField("so",
>>Field.Store.YES, false);
>>            startOffset.setLongValue( statement.getStartOffset() );
>>            doc.add(startOffset);
>>            NumericField endOffset = new NumericField("eo",
>>Field.Store.YES, false);
>>            endOffset.setLongValue( statement.getEndOffset() );
>>            doc.add(endOffset);
>>            NumericField timestampField = new NumericField("ts",
>>Field.Store.YES, true);
>>            timestampField.setLongValue(statement.getStatementTime().getTime());
>>            doc.add(timestampField);
>>            doc.add(new Field("fn", fileTagName, Field.Store.YES,
>>Field.Index.NO ));
>>            doc.add(new Field("ct", statement.getContent(),
>>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>>I am getting following results (index size vs log files) with this scheme:
>>The size of the logs is 385MB.
>>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>>385     /var/tmp/logs
>>The size of the index is 143MB.
>>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>>143     /var/tmp/index
>>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>>anything I can do to move this to the desired ration? Of course what
>>would help is the words histogram and here the top of the output of
>>the words histogram script that I ran on the logs:
>>Total number of words: 26935271
>>Number of different words: 551981
>>The most common words are:
>>as      3395203
>>10      797708
>>13      797662
>>2011    795595
>>at      787365
>>timer   746790
>>Could anyone suggest a better way to organize index for my logs? And
>>by better I mean more compact. Or this is as good as it gets? I tried
>>to optimize and got a 2Mb improvement (index went from 145Mb to
>>Could anyone point to an article that deals with indexing of logs? Any
>>help, suggestions and pointers are greatly appreciated.
>>Thanks for any and all help and cheers,

