lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Suggestions or best practices for indexing the logs
Date Mon, 17 Oct 2011 14:12:58 GMT
Alex,

You could try compressing the content field - that might help a bit.

Otis
----

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>________________________________
>From: Alex Shneyderman <a.shneyderman@gmail.com>
>To: general@lucene.apache.org
>Sent: Thursday, October 13, 2011 7:21 PM
>Subject: Suggestions or best practices for indexing the logs
>
>Hello, everybody!
>
>I am trying to introduce faster searches to our application that sifts
>through the logs. And Lucene seems to be the tool to use here. The one
>peculiarity of the problem it seems there are few files and they
>contain many log statements. I avoid storing the text in the index
>itself. Given all this I setup indexing as follows:
>
>I iterate over a log file and for each statement in the log file I do
>the indexing of the statements content.
>
>Here is the java code that does field additions:
>
>            NumericField startOffset = new NumericField("so",
>Field.Store.YES, false);
>            startOffset.setLongValue( statement.getStartOffset() );
>            doc.add(startOffset);
>
>            NumericField endOffset = new NumericField("eo",
>Field.Store.YES, false);
>            endOffset.setLongValue( statement.getEndOffset() );
>            doc.add(endOffset);
>
>            NumericField timestampField = new NumericField("ts",
>Field.Store.YES, true);
>            timestampField.setLongValue(statement.getStatementTime().getTime());
>            doc.add(timestampField);
>
>            doc.add(new Field("fn", fileTagName, Field.Store.YES,
>Field.Index.NO ));
>            doc.add(new Field("ct", statement.getContent(),
>Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO));
>
>I am getting following results (index size vs log files) with this scheme:
>
>The size of the logs is 385MB.
>(00:13:08) /var/tmp/logs > du -ms /var/tmp/logs
>385     /var/tmp/logs
>
>
>The size of the index is 143MB.
>(00:41:26) /var/tmp/index > du -ms /var/tmp/index
>143     /var/tmp/index
>
>Is this a normal ratio 143Mb / 385 Mb - seems like it is a bit too
>much (I would expect something like 1/5 - 1/7 for the index)? Is there
>anything I can do to move this to the desired ration? Of course what
>would help is the words histogram and here the top of the output of
>the words histogram script that I ran on the logs:
>
>Total number of words: 26935271
>Number of different words: 551981
>The most common words are:
>as      3395203
>10      797708
>13      797662
>2011    795595
>at      787365
>timer   746790
>...
>
>Could anyone suggest a better way to organize index for my logs? And
>by better I mean more compact. Or this is as good as it gets? I tried
>to optimize and got a 2Mb improvement (index went from 145Mb to
>143Mb).
>
>Could anyone point to an article that deals with indexing of logs? Any
>help, suggestions and pointers are greatly appreciated.
>
>Thanks for any and all help and cheers,
>Alex.
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message