lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Shneyderman <a.shneyder...@gmail.com>
Subject Re: Suggestions or best practices for indexing the logs
Date Mon, 17 Oct 2011 14:34:10 GMT
Otis,

Not sure I understand. Could you elaborate?

Note, content is not stored in the index itself. Hence my confusion to
your suggestion.

Thanks,
Alex.

On Mon, Oct 17, 2011 at 4:12 PM, Otis Gospodnetic
<otis_gospodnetic@yahoo.com> wrote:
> Alex,
>
> You could try compressing the content field - that might help a bit.
>
> Otis
> ----
>
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>>________________________________
>>From: Alex Shneyderman <a.shneyderman@gmail.com>
>>To: general@lucene.apache.org
>>Sent: Thursday, October 13, 2011 7:21 PM
>>Subject: Suggestions or best practices for indexing the logs
>>
>>Hello, everybody!
>>
>>I am trying to introduce faster searches to our application that sifts
>>through the logs. And Lucene seems to be the tool to use here. The one
>>peculiarity of the problem it seems there are few files and they
>>contain many log statements. I avoid storing the text in the index
>>itself. Given all this I setup indexing as follows:
>>
>>I iterate over a log file and for each statement in the log file I do
>>the indexing of the statements content.
>>
>>Here is the java code that does field additions:
>>
>>            NumericField startOffset = new NumericField("so",
>>Field.Store.YES, false);
>>            startOffset.setLongValue( statement.getStartOffset() );
>>            doc.add(startOffset);
>>
>>            NumericField endOffset = new NumericField("eo",
>>Field.Store.YES, false);
>>            endOffset.setLongValue( statement.getEndOffset() );
>>            doc.add(endOffset);
>>
>>            NumericField timestampField = new NumericField("ts",
>>Field.Store.YES, true);
>>            timestampField.setLongValue(statement.getStatementTime().getTime());
>>            doc.add(timestampField);
>>
>>            doc.add(new Field("fn", fileTagName, Field.Store.YES,
>>Field.Index.NO ));
>>            doc.add(new Field("ct", statement.getContent(),
>>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>>
>>I am getting following results (index size vs log files) with this scheme:
>>
>>The size of the logs is 385MB.
>>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>>385     /var/tmp/logs
>>
>>
>>The size of the index is 143MB.
>>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>>143     /var/tmp/index
>>
>>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>>anything I can do to move this to the desired ration? Of course what
>>would help is the words histogram and here the top of the output of
>>the words histogram script that I ran on the logs:
>>
>>Total number of words: 26935271
>>Number of different words: 551981
>>The most common words are:
>>as      3395203
>>10      797708
>>13      797662
>>2011    795595
>>at      787365
>>timer   746790
>>...
>>
>>Could anyone suggest a better way to organize index for my logs? And
>>by better I mean more compact. Or this is as good as it gets? I tried
>>to optimize and got a 2Mb improvement (index went from 145Mb to
>>143Mb).
>>
>>Could anyone point to an article that deals with indexing of logs? Any
>>help, suggestions and pointers are greatly appreciated.
>>
>>Thanks for any and all help and cheers,
>>Alex.
>>
>>
>>

Mime
View raw message