lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Tesser <>
Subject Re: Question about many fields within a single index
Date Thu, 31 Dec 2009 11:09:54 GMT
right we do analyze a number of fields.  We use the WHiteSpace
whenever we have a text field. So maybe 5 on average per guy. Can be
more of course.

Jason Tesser
dotCMS Lead Development Manager

On Wed, Dec 30, 2009 at 10:44 PM, Tom Hill <> wrote:
> Hi -
> One thing to consider is field norms. If your fields aren't analyzed, this
> doesn't apply to you.
> But if you do have norms, I believe that it's one by per field with norms x
> number of documents. It doesn't matter if the field occurs in a document or
> not, it's nTotalFields x nDocs.
> So, an index with 10,000 documents, with one field each, same field for all
> docs:
> -rw-r--r--   1 tom   wheel   10004 Dec 30 18:54 _0.nrm
> a 10,000 doc index, where each doc has one of 100 different field names, but
> still only one field per doc:
> -rw-r--r--   1 tom   wheel  1000004 Dec 30 18:55 _0.nrm
> As you can see, your total space for norms goes up linearly with the number
> of fields.
> Tom
> On Wed, Dec 30, 2009 at 10:19 AM, Renaud Delbru <>wrote:
>> Hi,
>> just sharing some personal experiences in this domain,
>> We performed some benchmarks in a similar setup (indexing millions of
>> documents with thousands of fields) to measure the impact of large number of
>> fields on a Lucene index.
>> We observed that more you have fields, more the dictionary will become
>> large. In fact, Lucene is creating term index by concatenating the field
>> name with the terms associated to this field. In the worst case scenario
>> (when terms occur in every field), you can have M fields * N terms entries
>> in the dictionary.
>> As a consequence, a term lookup in the dictionary will take longer. This
>> can have a significant impact when the cache is cold. When the cache is warm
>> (when parts of the dictionary are in memory), the time overhead is not
>> significant, even null.
>> If, in your data collection, the majority of the terms are locals to one
>> field (i.e., when a term occurs only in one single field, like for example
>> the timestamp of the document), your dictionary will not grow that much and
>> therefore you will probably not notice the increase in time of dictionary
>> lookups.
>> And, as erick explained previously, its depends also of the size of your
>> index, and of your use case. If it is relatively small, the overhead will be
>> imperceptible. However, if your index is large (millions of documents), you
>> will probably notice the overhead the first time the query is executed
>> (cold-cache).
>> I haven't tested with Lucene 3.0, but I have read somewhere (correct me if
>> I am wrong) that this new version includes some optimisations for dictionary
>> lookups, which should minimize the overhead.
>> --
>> Renaud Delbru
>> On 30/12/09 16:18, Jason Tesser wrote:
>>> I have a situation where I might have 1000 different types of Lucene
>>> Documents each with 10 or so fields with different names that get
>>> indexed.
>>> I am wondering if this is bad to do within Lucene.  I end up with
>>> 10,000 fields within the index although any given document has only 10
>>> or so.
>>> I was hoping not to have to have many indexes under the covers if I
>>> can avoid it but I don't want performance to suffer either.
>>> Any thoughts?
>>> Thanks,
>>> Jason Tesser
>>> dotCMS Lead Development Manager
>>> 1-305-858-1422
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message