lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danil ŢORIN <torin...@gmail.com>
Subject Re: Indexing and Searching linked files
Date Tue, 19 Jan 2010 14:43:48 GMT
You can simple index both "files" and "cards" into same index (no need
for 2 indexes)

Lucene easily support documents of different structure.

You may add some boosting per field or document, and tune similarity
to get most important stuff in top.


On Tue, Jan 19, 2010 at 16:35, Anna Hunecke <annahunecke@yahoo.de> wrote:
> The field size is restricted to 1 million tokens, because of the very reasons you mentioned.
> So, even if I have one separate field for the content of a file, I might reach the limit
if the file is really big. But I can't help that. What I want to avoid is that the whole content
of some files can not be found because I used one field for the content of all files and they
just could not be appended anymore.
>
> --- Erick Erickson <erickerickson@gmail.com> schrieb am Di, 19.1.2010:
>
>> Von: Erick Erickson <erickerickson@gmail.com>
>> Betreff: Re: Indexing and Searching linked files
>> An: java-user@lucene.apache.org
>> Datum: Dienstag, 19. Januar 2010, 14:43
>> What field size limit are you talking
>> about here? Because 10,000
>> tokens is the default, but you can increase it to
>> Integer.MAX_VALUE.
>>
>> So are you really talking billions of tokens here? Your
>> index
>> quickly becomes unmanageable if you're allowing it to grow
>> by such increments.
>>
>> One can argue, IMO, that the first N (10M, say) tokens/file
>> is
>> "enough" and there's not much real value in the rest, but
>> that
>> can be a weak argument depending on the problem space....
>>
>> But if you're really committed to indexing an unbounded
>> number
>> of arbitrarily large files...you'll fail. Sometime,
>> somewhere, somebody
>> will want to index enough to violate whatever limits you
>> have (disk,
>> memory, time, whatever). So I think you'd be farther ahead
>> to ask your
>> product manager what limits are reasonable and go from
>> there...
>>
>> HTH
>> Erick
>>
>> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <annahunecke@yahoo.de>
>> wrote:
>>
>> > Hi!
>> > I have been working with Lucene for a while now. So
>> far, I found helpful
>> > tips on this list, so I hope somebody can help me with
>> my problem:
>> >
>> > In our app information is grouped in so-called cards.
>> Now, it should be
>> > made possible to also search on files linked to the
>> cards. You can link
>> > arbitrarily many files to a card and the size of the
>> files is also not
>> > restricted.
>> > So, as far as I can see, there are two ways to do
>> this:
>> >
>> > 1. Add the content of the files to the search index of
>> the card. First, I
>> > thought that I could just have an additional field in
>> the index which
>> > contains the content of all the files. But then, if
>> the files are very big,
>> > I could hit the field size limit, and would possibly
>> not get the content of
>> > all files indexed. So, I would need one field per
>> file. The problem I have
>> > then is that I don't know how many files I have and
>> how large the index
>> > would get. This is risky, because some customers have
>> a lot of data.
>> >
>> > 2. Create a separate index for files. The documents in
>> this index would
>> > contain one file each, so I would not have the problem
>> that I don't know how
>> > many fields I have. But then, the searching is a
>> problem:
>> > I would need to search on both the card and the
>> document index, and somehow
>> > merge the results together. I sort by score always,
>> but, as I understand it,
>> > the scores of the results of two different indexes are
>> not comparable.
>> >
>> > So, which way do you think is better?
>> >
>> > Best,
>> > Anna
>> >
>> > __________________________________________________
>> > Do You Yahoo!?
>> > Sie sind Spam leid? Yahoo! Mail verfügt über einen
>> herausragenden Schutz
>> > gegen Massenmails.
>> > http://mail.yahoo.com
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message