lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Hill <solr-l...@worldware.com>
Subject Re: Question about many fields within a single index
Date Thu, 31 Dec 2009 03:44:39 GMT
Hi -

One thing to consider is field norms. If your fields aren't analyzed, this
doesn't apply to you.

But if you do have norms, I believe that it's one by per field with norms x
number of documents. It doesn't matter if the field occurs in a document or
not, it's nTotalFields x nDocs.

So, an index with 10,000 documents, with one field each, same field for all
docs:

-rw-r--r--   1 tom   wheel   10004 Dec 30 18:54 _0.nrm

a 10,000 doc index, where each doc has one of 100 different field names, but
still only one field per doc:

-rw-r--r--   1 tom   wheel  1000004 Dec 30 18:55 _0.nrm

As you can see, your total space for norms goes up linearly with the number
of fields.

Tom

On Wed, Dec 30, 2009 at 10:19 AM, Renaud Delbru <renaud.delbru@deri.org>wrote:

> Hi,
>
> just sharing some personal experiences in this domain,
>
> We performed some benchmarks in a similar setup (indexing millions of
> documents with thousands of fields) to measure the impact of large number of
> fields on a Lucene index.
> We observed that more you have fields, more the dictionary will become
> large. In fact, Lucene is creating term index by concatenating the field
> name with the terms associated to this field. In the worst case scenario
> (when terms occur in every field), you can have M fields * N terms entries
> in the dictionary.
> As a consequence, a term lookup in the dictionary will take longer. This
> can have a significant impact when the cache is cold. When the cache is warm
> (when parts of the dictionary are in memory), the time overhead is not
> significant, even null.
>
> If, in your data collection, the majority of the terms are locals to one
> field (i.e., when a term occurs only in one single field, like for example
> the timestamp of the document), your dictionary will not grow that much and
> therefore you will probably not notice the increase in time of dictionary
> lookups.
>
> And, as erick explained previously, its depends also of the size of your
> index, and of your use case. If it is relatively small, the overhead will be
> imperceptible. However, if your index is large (millions of documents), you
> will probably notice the overhead the first time the query is executed
> (cold-cache).
>
> I haven't tested with Lucene 3.0, but I have read somewhere (correct me if
> I am wrong) that this new version includes some optimisations for dictionary
> lookups, which should minimize the overhead.
> --
> Renaud Delbru
>
>
> On 30/12/09 16:18, Jason Tesser wrote:
>
>> I have a situation where I might have 1000 different types of Lucene
>> Documents each with 10 or so fields with different names that get
>> indexed.
>>
>> I am wondering if this is bad to do within Lucene.  I end up with
>> 10,000 fields within the index although any given document has only 10
>> or so.
>>
>> I was hoping not to have to have many indexes under the covers if I
>> can avoid it but I don't want performance to suffer either.
>>
>> Any thoughts?
>>
>> Thanks,
>> Jason Tesser
>> dotCMS Lead Development Manager
>> 1-305-858-1422
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message