lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: docid is just a signed int32
Date Fri, 19 Aug 2016 16:43:37 GMT
Hi,

The Lucene internal DocId is not a unique identifier, it is not even stable!
It is just a temporary property to identify a document in an index segment / shard and is
only valid for the lifetime of an IndexReader.

Lucene (and Solr / Elasticsearch) can hold "indexes" with much more than 2 billion documents,
because they shard internally (which a database is also doing). Direct Lucene users are just
on a lower level than "apllication" / "database" users. Would you take care how MySQL internally
addresses the rows in tables?

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Cristian Lorenzetto [mailto:cristian.lorenzetto@gmail.com]
> Sent: Thursday, August 18, 2016 5:58 PM
> To: Lucene Users <java-user@lucene.apache.org>
> Subject: Re: docid is just a signed int32
> 
> normally databases supports at least long primary key.
> try to ask to twitter application , for example increasing every year more
> than 4 petabytes :) Maybe they use big storage devices bigger than a pc
> storage:)
> However If you offer a possibility to use shards ... it is a possibility
> anyway :)
> For this reason, my suggestion was different ... was not related to size of
> repository , but size of research result :):):)
> 
> " A suggestion for possible changes in future is to not use java array but
> > Iterator. Iterator is a ADT more scalable , not sucking memory for
> > returning documents."
> 
> it is just a suggestion anyway for my loved lucene :):)
> 
> 
> 2016-08-18 17:43 GMT+02:00 Greg Bowyer <gbowyer@fastmail.co.uk>:
> 
> > What are you trying to index that has more than 3 billion documents per
> > shard / index and can not be split as Adrien suggests?
> >
> >
> >
> > On Thu, Aug 18, 2016, at 07:35 AM, Cristian Lorenzetto wrote:
> > > Maybe lucene has maxsize 2^31 because result set are java array where
> > > length is a int type.
> > > A suggestion for possible changes in future is to not use java array but
> > > Iterator. Iterator is a ADT more scalable , not sucking memory for
> > > returning documents.
> > >
> > >
> > > 2016-08-18 16:03 GMT+02:00 Glen Newton <glen.newton@gmail.com>:
> > >
> > > > Or maybe it is time Lucene re-examined this limit.
> > > >
> > > > There are use cases out there where >2^31 does make sense in a single
> > index
> > > > (huge number of tiny docs).
> > > >
> > > > Also, I think the underlying hardware and the JDK have advanced to
> make
> > > > this more defendable.
> > > >
> > > > Constructively,
> > > > Glen
> > > >
> > > >
> > > > On Thu, Aug 18, 2016 at 9:55 AM, Adrien Grand <jpountz@gmail.com>
> > wrote:
> > > >
> > > > > No, IndexWriter enforces that the number of documents cannot go
> over
> > > > > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> > > > > BaseCompositeReader computes the number of documents in a long
> > variable
> > > > and
> > > > > ensures it is less than 2^31, so you cannot have indexes that contain
> > > > more
> > > > > than 2^31 documents.
> > > > >
> > > > > Larger collections should be written to multiple shards and use
> > > > > TopDocs.merge to merge results.
> > > > >
> > > > > Le jeu. 18 août 2016 à 15:38, Cristian Lorenzetto <
> > > > > cristian.lorenzetto@gmail.com> a écrit :
> > > > >
> > > > > > docid is a signed int32 so it is not so big, but really docid
seams
> > > > not a
> > > > > > primary key unmodifiable but a temporary id for the view related
> > to a
> > > > > > specific search.
> > > > > >
> > > > > > So repository can contains more than 2^31 documents.
> > > > > >
> > > > > > My deduction is correct ? is there a maximum size for lucene
index?
> > > > > >
> > > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message