lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Geebee Coder <g.b.co...@gmail.com>
Subject Re: Using Lucene to model ownership of documents
Date Wed, 22 Jun 2016 15:34:34 GMT
Thanks Denis. My mistake. For a and b, indexing speed, size and search
performance was similar.

I agree on the simplicity comment.
For anyone who might come across this, here's our best solution so far.
(for Elastic search)


for every customer, use Elastic Search's nested fields
e.g. ownership of a document by customers aab and aac is represented as
a: ab, ac
This solution compacts the index size but increases the indexing time
somewhat. similar search performance as having one "ownership" field with
all the customers concatenated.




On Thu, Jun 16, 2016 at 9:27 PM, Denis Bazhenov <dotsid@gmail.com> wrote:

> The speed for a and b, should be the same, at least from conceptual point
> of view. The number of terms generated for each scenario is equal.
> Therefore, index size and vocabulary size should be the same.
>
> I’m wondering why there is difference. It seems like there is some penalty
> for writing/reading terms for different fields, but I can’t elaborate on
> that. Could you provide index size for scenarios a and b?
>
> Scenario c could be the fastest in terms of search and indexing speed, but
> it’s far more complex and make sense only if you have a need for scaling
> your system. Which imply you can’t solve problem on the single box.
>
> So, if there is no need for scaling, I’d go with b because of simplicity.
>
> > On Jun 15, 2016, at 23:25, Geebee Coder <g.b.coder@gmail.com> wrote:
> >
> > Hi there,
> > I would like to use Lucene to solve the following problem:
> >
> > 1.We have about 100k customers and we have 25 millions of documents.
> >
> > 2.When a customer performs a text search on the document space, we want
> to
> > return only documents that the customer has access to.
> >
> > 3.The # of documents a customer owns varies a lot. some have close to 23
> > million, some have close to 10k and some own a third of the documents
> etc.
> >
> > What is an efficient way to use Lucene in this scenario in terms of
> > performance and indexing?
> > We have tried a number of solutions such as
> >
> > a)100k boolean fields per document that indicates whether a customer has
> > access to the document.
> > b)A single text field that has a list of customers who owns the document
> > e.g. (customers field : "abc abd cfx...")
> > c) the above option with shards by customers
> >
> > The search&index performance for a was bad. b,c performed better for
> search
> > but lengthened the time needed for indexing & index size.
> > We are also thinking about using a custom filter but we are concerned
> about
> > the memory requirements.
> >
> > Any ideas/suggestions would be really appreciated.
>
> ---
> Denis Bazhenov <dotsid@gmail.com>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message