lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing and Searching linked files
Date Tue, 19 Jan 2010 13:43:54 GMT
What field size limit are you talking about here? Because 10,000
tokens is the default, but you can increase it to Integer.MAX_VALUE.

So are you really talking billions of tokens here? Your index
quickly becomes unmanageable if you're allowing it to grow
by such increments.

One can argue, IMO, that the first N (10M, say) tokens/file is
"enough" and there's not much real value in the rest, but that
can be a weak argument depending on the problem space....

But if you're really committed to indexing an unbounded number
of arbitrarily large files...you'll fail. Sometime, somewhere, somebody
will want to index enough to violate whatever limits you have (disk,
memory, time, whatever). So I think you'd be farther ahead to ask your
product manager what limits are reasonable and go from there...

HTH
Erick

On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <annahunecke@yahoo.de> wrote:

> Hi!
> I have been working with Lucene for a while now. So far, I found helpful
> tips on this list, so I hope somebody can help me with my problem:
>
> In our app information is grouped in so-called cards. Now, it should be
> made possible to also search on files linked to the cards. You can link
> arbitrarily many files to a card and the size of the files is also not
> restricted.
> So, as far as I can see, there are two ways to do this:
>
> 1. Add the content of the files to the search index of the card. First, I
> thought that I could just have an additional field in the index which
> contains the content of all the files. But then, if the files are very big,
> I could hit the field size limit, and would possibly not get the content of
> all files indexed. So, I would need one field per file. The problem I have
> then is that I don't know how many files I have and how large the index
> would get. This is risky, because some customers have a lot of data.
>
> 2. Create a separate index for files. The documents in this index would
> contain one file each, so I would not have the problem that I don't know how
> many fields I have. But then, the searching is a problem:
> I would need to search on both the card and the document index, and somehow
> merge the results together. I sort by score always, but, as I understand it,
> the scores of the results of two different indexes are not comparable.
>
> So, which way do you think is better?
>
> Best,
> Anna
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz
> gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message