lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Wee <peich...@gmail.com>
Subject Re: background merge hit exception
Date Thu, 10 Apr 2014 05:10:02 GMT
Hi Jose,

Thank you for very informative response.

I have commented out the line of codes that does the forceMerge(50) and
commit() while the indexing is happening. Also increase the ram buffer size

iwc.setRAMBufferSizeMB(512.0);

and after index is done, then only forceMerge and commit but this time with
large merge segments, that is 50.

if (writer != null && forceMerge) {
     writer.forceMerge(50);
      writer.commit();
}

With these changed, the exceptions reported initially, is no longer
happening.

Thank you again.

Jason


On Tue, Apr 8, 2014 at 8:50 PM, Jose Carlos Canova <
jose.carlos.canova@gmail.com> wrote:

> Hi Jason,
>
> No, the StrackTrace shows clearly the cause of the errror occurred during
> the merge into a single index file segment(forgeMerge parameter defines the
> number of desired segments at end).
>
> During the indexing of a document, Lucene might decide to create a new
> segment of the information extracted from a document that you have created
> to index it, somewhere on
> Lucene<http://lucene.apache.org/core/3_0_3/fileformats.html>documentation
> has a description of each file extension and its usage by the
> program.
>
> ForceMerge is an option:
>
> You can also avoid the "forceMerge" letting all segments "as is", the
> retrieval of results will work as same manner, maybe a little slowly
> because the IndexReader will be mounted over several "index segments" but
> works as the same manner, which means the forceMerge to minimize the number
> of index segments can be avoided without harm the search results.
>
> Regarding how to index files,
>
> I did something different to index files found on a directory structure.
> I used the FileVisitor<
> http://docs.oracle.com/javase/7/docs/api/java/nio/file/FileVisitor.html>to
> accumulate which files would be targeted to index, which means first
> scan the files,
> then after the scan, extract their content using
> tika<http://tika.apache.org/> (a
> choice) to finally index them.
>
> With this you can avoid some memory issues and separate the "scan process
> (locate the files)" from the content extraction process (tika extraction or
> other file read routine)  from the "index process(lucene)", because all of
> them are
> memory consuming (for example large pdf files of big string segments).
>
> The disadvantage is that a little bit slow process (if all tasks run's on
> same jvm will obligate to coordinate all threads), but with advantage is
> that permit you to divide the tasks into "sub tasks" and distribute them
> using a cache or a message queue like "activemq<
> http://activemq.apache.org/>",
>  subtasks using a "message queue" also lets you to distribute among
> different processes (jvm's) and machines. On practice take a little bit
> time since you have to write some blocks of line of code to manage all of
> those subtasks.
>
>
>
> att.
>
>
>
>
> On Tue, Apr 8, 2014 at 4:02 AM, Jason Wee <peichieh@gmail.com> wrote:
>
> > Hello Jose,
> >
> > Thank you for your response, I took a closer look. Below are my
> responses:
> >
> >
> > > Seems that you want to force a max number of segments to 1,
> >
> >       // you're done adding documents to it):
> >       //
> >       writer.forceMerge(1);
> >
> >       writer.close();
> >
> > Yes, the line of code is uncommented because we want to understand how
> > it work when index big data sets. Should this be a concern?
> >
> >
> > > On a previous thread someone answered that the number of segments will
> > > affect the Index Size, and is not related with Index Integrity (like
> size
> > > of index may vary according with number of segments).
> >
> > okay, no idea what the above actually mean but I would guess perhaps
> > the code we added, cause this exception?
> >
> >               if (file.isDirectory()) {
> >                     String[] files = file.list();
> >                     // an IO error could occur
> >                     if (files != null) {
> >                         for (int i = 0; i < files.length; i++) {
> >                             indexDocs(writer, new File(file, files[i]),
> >                                     forceMerge);
> >                             if (forceMerge && writer.hasPendingMerges())
> {
> >                                 if (i % 1000 == 0 && i != 0) {
> >                                     logger.trace("forcing merge now.");
> >                                     try {
> >                                         writer.forceMerge(50);
> >                                         writer.commit();
> >                                     } catch (OutOfMemoryError e) {
> >                                         logger.error("out of memory
> > during merging ", e);
> >                                         throw new
> > OutOfMemoryError(e.toString());
> >                                     }
> >                                 }
> >                             }
> >                         }
> >                     }
> >
> >                 } else {
> >                     FileInputStream fis;
> >
> >
> > > Should be...
> >
> > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
> > >      IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> > > analyzer);
> >
> > yes, we were and still referencing lucene_46 in our analyzer.
> >
> >
> > /Jason
> >
> >
> >
> > On Sat, Apr 5, 2014 at 9:01 PM, Jose Carlos Canova <
> > jose.carlos.canova@gmail.com> wrote:
> >
> > > Seems that you want to force a max number of segments to 1,
> > > On a previous thread someone answered that the number of segments will
> > > affect the Index Size, and is not related with Index Integrity (like
> size
> > > of index may vary according with number of segments).
> > >
> > > on version 4.6 there is a small issue on sample that is
> > >
> > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
> > >       IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40,
> > > analyzer);
> > >
> > >
> > > Should be...
> > >
> > >
> > > Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
> > >       IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46,
> > > analyzer);
> > >
> > >
> > > With this probably the line related to the codec will change too.
> > >
> > >
> > >
> > > On Fri, Apr 4, 2014 at 3:52 AM, Jason Wee <peichieh@gmail.com> wrote:
> > >
> > > > Hello again,
> > > >
> > > > A little background of our experiment. We are storing lucene (version
> > > > 4.6.0) on top of cassandra. We are using the demo IndexFiles.java
> from
> > > the
> > > > lucene with minor modification such that the directory used is
> > reference
> > > to
> > > > the CassandraDirectory.
> > > >
> > > > With large dataset (that is, index more than 50000 of files), after
> > index
> > > > is done, and set forceMerge(1) and get the following exception.
> > > >
> > > >
> > > > BufferedIndexInput readBytes [ERROR] bufferStart = '0'
> bufferPosition =
> > > > '1024' len = '9252' after = '10276'
> > > > BufferedIndexInput readBytes [ERROR] length = '8192'
> > > >  caught a class java.io.IOException
> > > >  with message: background merge hit exception: _1(4.6):c10250
> > > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > > > [maxNumSegments=1]
> > > > java.io.IOException: background merge hit exception: _1(4.6):c10250
> > > > _0(4.6):c10355 _2(4.6):c10297 _3(4.6):c10217 _4(4.6):c8882 into _5
> > > > [maxNumSegments=1]
> > > >         at
> > > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1755)
> > > >         at
> > > > org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1691)
> > > >         at
> org.apache.lucene.store.IndexFiles.main(IndexFiles.java:159)
> > > > Caused by: java.io.IOException: read past EOF:
> > > > CassandraSimpleFSIndexInput(_1.nvd in path="_1.cfs"
> > > slice=5557885:5566077)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:186)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:125)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.loadNumeric(Lucene42DocValuesProducer.java:230)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesProducer.getNumeric(Lucene42DocValuesProducer.java:186)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.index.SegmentCoreReaders.getNormValues(SegmentCoreReaders.java:159)
> > > >         at
> > > >
> > >
> >
> org.apache.lucene.index.SegmentReader.getNormValues(SegmentReader.java:516)
> > > >         at
> > > >
> > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:232)
> > > >         at
> > > > org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:127)
> > > >         at
> > > >
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4057)
> > > >         at
> > > org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
> > > >         at
> > > >
> > > >
> > >
> >
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
> > > >
> > > >
> > > > We do not know what is wrong as our understanding on lucene is
> limited.
> > > Can
> > > > someone give explanation on what is happening, or which might be the
> > > > possible error source is?
> > > >
> > > > Thank you and any advice is appreciated.
> > > >
> > > > /Jason
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message