lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Wikipedia Index
Date Tue, 19 Jun 2012 20:29:39 GMT
Likely the bottleneck is pulling content from the database?  Maybe
test just that and see how long it takes?

24 hours is way too long to index all of Wikipedia.  For example, we
index Wikipedia every night for our trunk/4.0 performance tests, here:

    http://people.apache.org/~mikemccand/lucenebench/indexing.html

The export is a bit old now (01/15/2011) but it takes just under 6
minutes to fully index it.  This is on a fairly beefy machine (24
cores)... and trunk/4.0 has substantial concurrency improvements over
3.x.

You can also try the ideas here:

    http://wiki.apache.org/lucene-java/ImproveIndexingSpeed

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 19, 2012 at 12:27 PM, Elshaimaa Ali
<elshaimaa.ali@hotmail.com> wrote:
>
> Hi everybody
> I'm using Lucene3.6 to index Wikipedia documents which is over 3 million article, the
data is on a mysql database and it is taking more than 24 hours so far.Do you know any tips
that can speed up the indexing process
> here is mycode:
> public static void main(String[] args) {             String indexPath = INDEXPATH;
          IndexWriter writer = null;       DatabaseConfiguration dbConfig = new DatabaseConfiguration();
          dbConfig.setHost(host);         dbConfig.setDatabase(data);        
    dbConfig.setUser(user);         dbConfig.setPassword(password);         dbConfig.setLanguage(Language.english);
>                  try {           Directory dir = FSDirectory.open(new File(indexPath));
                 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);  
     IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31, analyzer);    
        iwc.setOpenMode(OpenMode.CREATE);       writer = new IndexWriter(dir, iwc);
                        }               catch (IOException e) {      
              System.out.println(" caught a " + e.getClass() +              
  "\n with message: " + e.getMessage());               }                  
          try {                         Wikipedia wiki = new Wikipedia(dbConfig);
                              Iterable<Page> wikipages = wiki.getPages();
//get wikipedia articles from the database                          Iterator
iter = wikipages.iterator();                           while(iter.hasNext()){
                         Page p = (Page)iter.next();                
            System.out.println(p.getTitle().getPlainTitle());                
                  Document doc = new Document();                      
           Field contentField = new Field("contents", p.getPlainText(), Field.Store.NO,
Field.Index.ANALYZED);                             Field titleField = new Field("title",
p.getTitle().getPlainTitle(),Field.Store.YES, Field.Index.NOT_ANALYZED );            
                    doc.add(contentField); // wiki page text              
                 doc.add(titleField); // wiki page title                
                writer.addDocument(doc);                            }
                      } catch (Exception e) {                      
  e.printStackTrace();                    }                        
        }
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message