lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Indexing speed
Date Mon, 31 Jan 2005 05:13:09 GMT
I believe most of the time is being spent in the Analyzer.  It should
be easy to empirically test this claim by using Field.Keyword instead
of Field.Text (Field.Keyword fields are not analyzed).  If that turns
out to be correct, then you could play with writing a custom and
optimal Analyzer.

Otis

--- Paul Smith <psmith@aconex.com> wrote:

> This relates to a previous post of mine regarding Context of 'lines'
> of 
> text (log4j events in my case):
> 
>
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg11869.html
> 
> I'm going through the process of writing quick and dirty 
> test-case/test-bed classes to validate whether my ideas are going to 
> work or not. 
> 
> For my first test, I thought I would write a quick indexer that
> indexed 
> a traditional log file by lines, with each line being a Document, so 
> that I could then search for matching lines and then do a context 
> search.   Yes this is exactly what 'grep' does and does very well,
> but I 
> thought if one was doing a lot of analysis of a log file (typical
> when 
> mentally analysing log files) it might be best to index it once, and 
> then search quickly many times.
> 
> Turns out that even using JUST a RamDirectory (which suprised me),  
> writing a Document for every line of text isn't as fast as I was
> hoping, 
> it is taking significantly longer than I hoped.  I played around with
> 
> the mergeFactor settings etc, but nothing really made much difference
> to 
> the indexing speed, other than NOT adding the Document to the
> index....  
> I have tried this out on my Mac laptop, as well as a test Linux
> server 
> with no noticeable difference.  (Both scenarios have the reading log 
> file, and new index on the same physical drive, which I know is not
> the 
> _best_ setup, but still).
> 
> This could well be my own stupidness, so here's what I'm doing.
> 
> Statistics on the Log File
> =================
> 
> The log file is 28meg, consisting of 409566 lines, of the form:
> 
> [2004-12-21 00:00:00,935 INFO 
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
> Finished 
> processing [mail box=stagingfax][MsgCount=0]
> [2004-12-21 00:00:00,986 INFO 
> ][ommand.ProcessFaxCmd][http-80-Processor9][192.168.0.220][] 
> Finished 
> processing [mail box=aconexnz9000][MsgCount=0]
> [2004-12-21 00:00:01,126 INFO ][             
> monitor][http-80-Processor9][192.168.0.220][] Controller duration:
> 212ms 
> url=/Fax, fowardDuration=-1, total=212
> [2004-12-21 00:00:03,668 ERROR][essFaxDeliveryAction][Thread-157][][]
> 
> Could not connect to mail server! 
> [host=test.aconex.com][username=outboundstagingfax][password=d3vf@x]
> javax.mail.AuthenticationFailedException: Login failed:
> authentication 
> failure
>         at
> com.sun.mail.imap.IMAPStore.protocolConnect(IMAPStore.java:330)
>         at javax.mail.Service.connect(Service.java:233)
>         at javax.mail.Service.connect(Service.java:134)
>         at 
>
com.aconex.fax.action.ProcessFaxDeliveryAction.perform(ProcessFaxDeliveryAction.java:68)
>         at 
>
com.aconex.scheduler.automatedTasks.FaxOutDeliveryMessageProcessorAT.run(FaxOutDeliveryMessageProcessorAT.java:62)
> 
> 
> ==================
> Source code for test-bed:
> ==================
> 
> public class TestBed1 {
> 
>     public static void main(String[] args) throws Exception {
>        
>         if(args.length <1) throw new IllegalArgumentException("not 
> enough args");
>         String filename = args[0];
>        
>         File file = new File(filename);
>         Analyzer a = new SimpleAnalyzer();
>        
>         String indexLoc = "/tmp/testbed1/";
>        
>         //IndexWriter writer = new IndexWriter(indexLoc, a, true);
>        
>         RAMDirectory ramDir = new RAMDirectory();
>         IndexWriter ramWriter = new IndexWriter(ramDir, a, true);
>        
>         long length = file.length();
>        
>         BufferedReader fileReader = new BufferedReader(new 
> FileReader(file));
>        
>         String line = "";
>         double processed = 0;
>         NumberFormat nf = NumberFormat.getPercentInstance();
>         nf.setMaximumFractionDigits(0);
>        
>         String percent = "";
>         String lastPercent = " ";
>         long lines =0;
>         while ((line = fileReader.readLine())!=null) {
>             Document doc = new Document();
>             doc.add(Field.UnStored("Line", line) );
>             ramWriter.addDocument(doc);
>             processed +=line.length();
>             lines++;
>             percent = nf.format(processed/length);
>             if (!percent.equals(lastPercent)){
>                 lastPercent = percent;
>                 System.out.println(percent + "(lines=" + lines +
> ")");
>             }
>         }
>         //writer.optimize();
>         //writer.close();
>        
>        
>     }
> }
> 
> =======
> 
> I did other simple tests by testing exactly how long it takes Java to
> 
> just read the lines of the file, and that is mega quick in
> comparison.  
> It's actually the "ramWriter.addDocument(doc)" line which seems to
> have 
> the biggest amount of work to do, and probably for good reason.  I
> had 
> originally tried to use Field.Text(...) to keep the line with the
> index 
> for Context later on, but even Unstored doesn't really make that much
> 
> difference from a stopwatch time point of view (creates a bigger
> index 
> of course).
> 
> I might setup a profiler and work through where it's taking the the 
> time, but you guys probably already know the answer.
> 
> I'm going to need much higher throughput for my utility to be useful.
> 
> Maybe that's just not achievable.
> 
> Thoughts?
> 
> cheers,
> 
> Paul Smith
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message