lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Paul Sondag" <jsond...@uiuc.edu>
Subject Re: Indexing correctly?
Date Sun, 12 Aug 2007 03:44:49 GMT
It takes roughly 6 hours for me to index a Gig of data.  The benchmarks take
quite a bit less if I'm reading it correctly.  I'll try out the
StringBuffer/Builder and let you know.  Thanks for the quick response and if
you have any more suggestions please let me know.

--JP

On 8/11/07, karl wettin <karl.wettin@gmail.com> wrote:
>
> How much slower than anticipated is it?
>
> I would start by using a StringBuffer/Builder rather than appending
> (immutable) strings to each other.
>
>
> 11 aug 2007 kl. 19.05 skrev John Paul Sondag:
>
> > Hi,
> >
> > I was hoping that maybe you guys could see if I'm somehow indexing
> > inefficiently.  I'm putting relevant parts of my code below.  I've
> > looked at
> > the "benchmarks" page on Lucene and my indexing time is taking a
> > substantial
> > amount of time more than what I see posted.  I'm not sure when I
> > should call
> > flush() ( I saw that I should be doing that on the
> > ImproveIndexingSpeed
> > page).  I'd really appreciate any advice.
> >
> > Here's my code:
> >
> >       File directory = new File( "/mounts/falcon5/disks/0/tcheng3/
> > Dataset");
> >       File[] theFiles = directory.listFiles();
> >
> >           //go through each file inside the directory and index it
> >           for(int curFile = 0; curFile < theFiles.length; curFile++)
> >           {
> >               File fin=theFiles[curFile];
> >
> >               //open up the file
> >               FileInputStream inf = new FileInputStream(fin);
> >               InputStreamReader isr = new InputStreamReader(inf,
> > "US-ASCII");
> >               BufferedReader in = new BufferedReader(isr);
> >               String text="";
> >               String docid="";
> >
> >             while (true) {
> >
> >             //read in the file one line at a time, and act accordingly
> >                 String line = in.readLine();
> >                 if (line == null) { break;}
> >
> >                  if (line.startsWith("<DOC>") ) {
> >                     //get docID
> >                     line = in.readLine();
> >                     String tempStr = line.substring(8,line.length());
> >                     int pos = tempStr.indexOf(' ');
> >                     docid = tempStr.substring(0,pos);
> >                     }else if (line.startsWith("</DOC>")) {
> >
> >                     Document doc = new Document();
> >
> >                       doc.add(new Field("contents",text,
> > Field.Store.NO,
> > Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
> >                     doc.add(new Field("DocID",docid, Field.Store.YES,
> > Field.Index.NO));
> >                       writer.addDocument(doc);
> >                     text="";
> >                 } else {
> >                     text = text + "\n" + line;
> >                 }
> >             }
> >
> >         }
> >
> >
> >           int numIndexed = writer.docCount();
> >
> >           writer.optimize();
> >           writer.close();
> >
> >
> > Thanks,
> >
> > --JP
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message