Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of torindan@gmail.com designates
 74.125.92.25 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=IBlnuhwQv96GXOKogOh3p6R+YCCk1KniqrSUu5h6Kojybw6hYG9MlFQyNgcRdhBOWC
         2jLNBlHx0OWpIoM5lAiBZzIWrT/Eio54SrGY4RaXhdB6IPH1tTeFFc303rKiklFrYFbR
         T1Uo9VZlQxJV51FyhsiQ2LQP9SKWC+um8O9d8=
MIME-Version: 1.0
In-Reply-To: <6839.98888.qm@web26203.mail.ukl.yahoo.com>
References: <359a92831001190543t4b5375c8y209e1180862c4dd6@mail.gmail.com>
	 <6839.98888.qm@web26203.mail.ukl.yahoo.com>
Date: Tue, 19 Jan 2010 16:43:48 +0200
Message-ID: <2ffb6d061001190643s63412102tc4facf4ae2537630@mail.gmail.com>
Subject: Re: Indexing and Searching linked files
From: =?UTF-8?B?RGFuaWwgxaJPUklO?= <torindan@gmail.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

You can simple index both "files" and "cards" into same index (no need
for 2 indexes)

Lucene easily support documents of different structure.

You may add some boosting per field or document, and tune similarity
to get most important stuff in top.


On Tue, Jan 19, 2010 at 16:35, Anna Hunecke <annahunecke@yahoo.de> wrote:
> The field size is restricted to 1 million tokens, because of the very rea=
sons you mentioned.
> So, even if I have one separate field for the content of a file, I might =
reach the limit if the file is really big. But I can't help that. What I wa=
nt to avoid is that the whole content of some files can not be found becaus=
e I used one field for the content of all files and they just could not be =
appended anymore.
>
> --- Erick Erickson <erickerickson@gmail.com> schrieb am Di, 19.1.2010:
>
>> Von: Erick Erickson <erickerickson@gmail.com>
>> Betreff: Re: Indexing and Searching linked files
>> An: java-user@lucene.apache.org
>> Datum: Dienstag, 19. Januar 2010, 14:43
>> What field size limit are you talking
>> about here? Because 10,000
>> tokens is the default, but you can increase it to
>> Integer.MAX_VALUE.
>>
>> So are you really talking billions of tokens here? Your
>> index
>> quickly becomes unmanageable if you're allowing it to grow
>> by such increments.
>>
>> One can argue, IMO, that the first N (10M, say) tokens/file
>> is
>> "enough" and there's not much real value in the rest, but
>> that
>> can be a weak argument depending on the problem space....
>>
>> But if you're really committed to indexing an unbounded
>> number
>> of arbitrarily large files...you'll fail. Sometime,
>> somewhere, somebody
>> will want to index enough to violate whatever limits you
>> have (disk,
>> memory, time, whatever). So I think you'd be farther ahead
>> to ask your
>> product manager what limits are reasonable and go from
>> there...
>>
>> HTH
>> Erick
>>
>> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke <annahunecke@yahoo.de>
>> wrote:
>>
>> > Hi!
>> > I have been working with Lucene for a while now. So
>> far, I found helpful
>> > tips on this list, so I hope somebody can help me with
>> my problem:
>> >
>> > In our app information is grouped in so-called cards.
>> Now, it should be
>> > made possible to also search on files linked to the
>> cards. You can link
>> > arbitrarily many files to a card and the size of the
>> files is also not
>> > restricted.
>> > So, as far as I can see, there are two ways to do
>> this:
>> >
>> > 1. Add the content of the files to the search index of
>> the card. First, I
>> > thought that I could just have an additional field in
>> the index which
>> > contains the content of all the files. But then, if
>> the files are very big,
>> > I could hit the field size limit, and would possibly
>> not get the content of
>> > all files indexed. So, I would need one field per
>> file. The problem I have
>> > then is that I don't know how many files I have and
>> how large the index
>> > would get. This is risky, because some customers have
>> a lot of data.
>> >
>> > 2. Create a separate index for files. The documents in
>> this index would
>> > contain one file each, so I would not have the problem
>> that I don't know how
>> > many fields I have. But then, the searching is a
>> problem:
>> > I would need to search on both the card and the
>> document index, and somehow
>> > merge the results together. I sort by score always,
>> but, as I understand it,
>> > the scores of the results of two different indexes are
>> not comparable.
>> >
>> > So, which way do you think is better?
>> >
>> > Best,
>> > Anna
>> >
>> > __________________________________________________
>> > Do You Yahoo!?
>> > Sie sind Spam leid? Yahoo! Mail verf=C3=BCgt =C3=BCber einen
>> herausragenden Schutz
>> > gegen Massenmails.
>> > http://mail.yahoo.com
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>
> __________________________________________________
> Do You Yahoo!?
> Sie sind Spam leid? Yahoo! Mail verf=C3=BCgt =C3=BCber einen herausragend=
en Schutz gegen Massenmails.
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org