lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anna Hunecke <annahune...@yahoo.de>
Subject Re: Indexing and Searching linked files
Date Tue, 19 Jan 2010 14:35:54 GMT
The field size is restricted to 1 million tokens, because of the very reasons you mentioned.
So, even if I have one separate field for the content of a file, I might reach the limit if
the file is really big. But I can't help that. What I want to avoid is that the whole content
of some files can not be found because I used one field for the content of all files and they
just could not be appended anymore.

--- Erick Erickson <erickerickson@gmail.com> schrieb am Di, 19.1.2010:

> Von: Erick Erickson <erickerickson@gmail.com>
> Betreff: Re: Indexing and Searching linked files
> An: java-user@lucene.apache.org
> Datum: Dienstag, 19. Januar 2010, 14:43
> What field size limit are you talking
> about here? Because 10,000
> tokens is the default, but you can increase it to
> Integer.MAX_VALUE.
> 
> So are you really talking billions of tokens here? Your
> index
> quickly becomes unmanageable if you're allowing it to grow
> by such increments.
> 
> One can argue, IMO, that the first N (10M, say) tokens/file
> is
> "enough" and there's not much real value in the rest, but
> that
> can be a weak argument depending on the problem space....
> 
> But if you're really committed to indexing an unbounded
> number
> of arbitrarily large files...you'll fail. Sometime,
> somewhere, somebody
> will want to index enough to violate whatever limits you
> have (disk,
> memory, time, whatever). So I think you'd be farther ahead
> to ask your
> product manager what limits are reasonable and go from
> there...
> 
> HTH
> Erick
> 
> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <annahunecke@yahoo.de>
> wrote:
> 
> > Hi!
> > I have been working with Lucene for a while now. So
> far, I found helpful
> > tips on this list, so I hope somebody can help me with
> my problem:
> >
> > In our app information is grouped in so-called cards.
> Now, it should be
> > made possible to also search on files linked to the
> cards. You can link
> > arbitrarily many files to a card and the size of the
> files is also not
> > restricted.
> > So, as far as I can see, there are two ways to do
> this:
> >
> > 1. Add the content of the files to the search index of
> the card. First, I
> > thought that I could just have an additional field in
> the index which
> > contains the content of all the files. But then, if
> the files are very big,
> > I could hit the field size limit, and would possibly
> not get the content of
> > all files indexed. So, I would need one field per
> file. The problem I have
> > then is that I don't know how many files I have and
> how large the index
> > would get. This is risky, because some customers have
> a lot of data.
> >
> > 2. Create a separate index for files. The documents in
> this index would
> > contain one file each, so I would not have the problem
> that I don't know how
> > many fields I have. But then, the searching is a
> problem:
> > I would need to search on both the card and the
> document index, and somehow
> > merge the results together. I sort by score always,
> but, as I understand it,
> > the scores of the results of two different indexes are
> not comparable.
> >
> > So, which way do you think is better?
> >
> > Best,
> > Anna
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Sie sind Spam leid? Yahoo! Mail verfügt über einen
> herausragenden Schutz
> > gegen Massenmails.
> > http://mail.yahoo.com
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> 

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails.

http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message