lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Renaud Delbru <renaud.del...@deri.org>
Subject Re: Question on number of fields in a document
Date Fri, 12 Mar 2010 14:44:44 GMT
There is some bottleneck when you have a large number of fields and of 
words. Each field has its own list of terms which means that the 
dictionary, in the worst case, could be of size n*m (with n the number 
of fields, and m the number of terms).
This can lead to some overhead when looking up a term in the case where 
n and m is large. (Term lookup occurs for each keyword in a query).

Another problem (for the end user) of using an arbitrary number of 
fields is that the user will have to know exactly which field names to 
query. By default, Lucene cannot search efficiently on an arbitrary 
number of fields, unless you create a "content" field that you will use 
to index the values from all the fields. This will duplicate the data 
inside the index (even if it is cheap to index two times the same data, 
it can be problematic for very large index).

We have released recently a plugin for Lucene (SIREn [1]) that tackles 
such particular problem. It has been developped initially to create a 
search engine for RDF data (standard model for data interchange on the 
web). It allows to index an arbitrary number of fields without facing 
the two previous problems, but also to keep web scale performance. In 
addition, it allows to use keyword search on the field names, and better 
support of multi-valued fields.

I think the best it to give try, do a benchmark using Lucene and SIREn, 
and see which one answers more your needs (in term of response time, and 
also on search capabilities). If your index stays relatively small (few 
thousands or maybe millions of documents), then maybe Lucene is a good 
choice, but if your expect to have a large index (millions of documents) 
with an arbitrary number of fields (thousands or even more like tens of 
thousands), then maybe SIREn will be more suitable.

[1] http://siren.sindice.com/
-- 
Renaud Delbru

On 12/03/10 13:43, Erick Erickson wrote:
> There's no requirement that all documents have the same
> fields, Lucene is fine with different docs having different
> fields.
>
> There's no limit on the number of different fields allowed
> that I know of, but I'm sure someone will chime in if there
> is....
>
> HTH
> Erick
>
> On Fri, Mar 12, 2010 at 7:51 AM, Vinicius Carvalho<
> viniciusccarvalho@gmail.com>  wrote:
>
>    
>> Hello there! We are indexing metadata for our medias. One ideia is that
>> each
>> user adds its own metadata, so each document may have different
>> number/name/type of fields. Is this ok on Lucene? I mean, is Lucene ok with
>> the this relax approach.
>>
>> Also, considering that each user may define its own metadata, we may have
>> several different types of fields. Is there a limit for this?
>>
>> Regards
>>
>> --
>> The intuitive mind is a sacred gift and the
>> rational mind is a faithful servant. We have
>> created a society that honors the servant and
>> has forgotten the gift.
>>
>>      
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message