lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Burton-West <tburt...@umich.edu>
Subject Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?
Date Fri, 02 Aug 2013 18:12:12 GMT
Thanks Robert,

Looks like it switches between seekCeil and seekExact:

"main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
[0x00002b32de0cc000]
jstack.out3-   java.lang.Thread.State: RUNNABLE
jstack.out3-    at
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekCeil(CompressingTermVectorsReader.java:846)
jstack.out3-    at
org.apache.lucene.index.TermsEnum.seekCeil(TermsEnum.java:89)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1110)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
jstack.out3-    at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
jstack.out3:    at
org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)
jstack.out3-



"main" prio=10 tid=0x000000000e79a000 nid=0x5fe5 runnable
[0x00002b32de0cc000]
   java.lang.Thread.State: RUNNABLE
        at
org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVTermsEnum.seekExact(CompressingTermVectorsReader.java:857)
        at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1103)
        at
org.apache.lucene.index.CheckIndex.testTermVectors(CheckIndex.java:1503)
        at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:613)
        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1854)

I don't think highlighting is too slow (at least for our small indexes),
but will take a look at the postingshighligher


Tom

>
>
> Hi Tom: with this large term vector file its not really 343GB but, as far
> as checkindex is concerned, its treated as 1000 343MB indexes (maybe more,
> they are compressed also): because each document's term vector is like a
> little inverted index for the document. Each one is on your large full-text
> field so it has its own term dictionary and "postings" (all those
> positions/offsets from your doc) to verify. Its probably the case that term
> vectors with huge numbers of unique terms aren't particularly optimized for
> your use-case either: for example seekCeil() operation looks like a linear
> scan to me: and checkindex tests term seeking if the termsenum supports ord
> (which it does). You could probably use jstack to confirm some of this. Was
> highlighting with vectors horribly slow? :)
>
> Its off-topic but maybe something like postingshighlighter would be a
> better fit for you, as it wouldnt duplicate the terms or positions, just
> encode some offsets into the .pay file.
>
> Anyway, In my opinion, we should think about a JIRA issue such that if you
> pass the -verbose flag to checkindex it prints some status information
> about its progress. We could also think about trying to improve seekCeil
> for term vector term dictionaries...
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message