lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Lucene 4.3.1 CheckIndex limitation 100 trillion tokens?
Date Tue, 30 Jul 2013 15:22:29 GMT
Can you get a strack trace so we can see where the thread is stuck?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jul 30, 2013 at 11:08 AM, Tom Burton-West <tburtonw@umich.edu> wrote:
> Thanks Mike,
>
> Billion not Trillion Doh!
>
> Wasn't thinking it through when I titled the e-mail.... The total number of
> tokens shouldn't be unusual compared to our other indexes since whether we
> index pages or whole docs, the number of tokens shouldn't change
> significantly.    The main difference between this and our other indexes is
> the number of documents.   Our regular indexes have maybe 800,000 docs
> wheras these have about 82 million.
>
> I'm not sure what is going on but I'm guessing that the Checkindex program
> has been caught in some GC loop for the last few days.  I didn't start it
> up with any GC logging or hooks to attach jconsole.  I'm going to kill it
> and maybe try again and give it more memory and maybe turn on GC logging.
>
> Tom
>
>
> On Tue, Jul 30, 2013 at 8:41 AM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> I think that's ~ 110 billion, not trillion, tokens :)
>>
>> Are you certain you don't have any term vectors?
>>
>> Even if your index has no term vectors, CheckIndex goes through all
>> docIDs trying to load them, but that ought to be very fast, and then
>> you should see "test: doc values..." after that.
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Jul 29, 2013 at 4:30 PM, Tom Burton-West <tburtonw@umich.edu>
>> wrote:
>> > We have very large indexes, almost a terabyte for a single index, and
>> > normally it takes overnight to run a checkindex.   I started a CheckIndex
>> > on Friday and today (Monday) it seems to be stuck testing vectors
>> although
>> > we haven't got vectors turned on. (See below)
>> > The output file was last written Jul 27 02:28,
>> > Note that in this 750 GB segment we have about  83 million docs with
>> about
>> > 2.4 billion unique terms and about 110 trillion tokens.
>> >
>> > Have we hit a new CheckIndex limit?
>> >
>> >
>> > Tom
>> >
>> > -----------------------
>> >
>> >
>> > Opening index @ /htsolr/lss-dev/solrs/4.2/3/core/data/index
>> >
>> > Segments file=segments_e numSegments=2 version=4.2.1 format=
>> > userData={commitTimeMSec=1374712392103}
>> >   1 of 2: name=_bch docCount=82946896
>> >     codec=Lucene42
>> >     compound=false
>> >     numFiles=12
>> >     size (MB)=752,005.689
>> >     diagnostics = {timestamp=1374657630506, os=Linux,
>> > os.version=2.6.18-348.12.1.el5, mergeFactor=16, source=merge,
>> > lucene.version=4.2.1 1461071 - mark - 2013-03-26 08:23:34, os.arch=amd64,
>> > mergeMaxNumSegments=2, java.version=1.6.0_16, java.vendor=Sun
>> Microsystems
>> > Inc.}
>> >     no deletions
>> >     test: open reader.........OK
>> >     test: fields..............OK [12 fields]
>> >     test: field norms.........OK [3 fields]
>> >     test: terms, freq, prox...OK [2442919802 terms; 73922320413
>> terms/docs
>> > pairs; 109976572432 tokens]
>> >     test: stored fields.......OK [960417844 total field count; avg 11.579
>> > fields per doc]
>> >     test: term vectors........
>> > ~
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message