Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 75673 invoked from network); 19 Jan 2010 14:44:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jan 2010 14:44:19 -0000 Received: (qmail 53093 invoked by uid 500); 19 Jan 2010 14:44:16 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 53019 invoked by uid 500); 19 Jan 2010 14:44:16 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 53007 invoked by uid 99); 19 Jan 2010 14:44:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2010 14:44:16 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of torindan@gmail.com designates 74.125.92.25 as permitted sender) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Jan 2010 14:44:09 +0000 Received: by qw-out-2122.google.com with SMTP id 3so851602qwe.53 for ; Tue, 19 Jan 2010 06:43:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=zxDC3RQAXls4K3nXY5NtwHgbOHjlDOkeMT9vTH2E24s=; b=Ey3SKeb4VY5Z7vh9JIiV4vRMd4Z+ZN2ZatdQSgjv7O1C52Sx6jO7GRK2EcoZ+gfSJJ H3qxybhjlLs8DBpEdMF86yuJNsDUwyyMAc1lwpgZTvmqkhz5MiIdyfJPUYJbnp1JSvWl FPHVlcMCHImSiJAU7HH7xSzgH2oG3ScV58E9U= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=IBlnuhwQv96GXOKogOh3p6R+YCCk1KniqrSUu5h6Kojybw6hYG9MlFQyNgcRdhBOWC 2jLNBlHx0OWpIoM5lAiBZzIWrT/Eio54SrGY4RaXhdB6IPH1tTeFFc303rKiklFrYFbR T1Uo9VZlQxJV51FyhsiQ2LQP9SKWC+um8O9d8= MIME-Version: 1.0 Received: by 10.229.31.206 with SMTP id z14mr4889860qcc.79.1263912228863; Tue, 19 Jan 2010 06:43:48 -0800 (PST) In-Reply-To: <6839.98888.qm@web26203.mail.ukl.yahoo.com> References: <359a92831001190543t4b5375c8y209e1180862c4dd6@mail.gmail.com> <6839.98888.qm@web26203.mail.ukl.yahoo.com> Date: Tue, 19 Jan 2010 16:43:48 +0200 Message-ID: <2ffb6d061001190643s63412102tc4facf4ae2537630@mail.gmail.com> Subject: Re: Indexing and Searching linked files From: =?UTF-8?B?RGFuaWwgxaJPUklO?= To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable You can simple index both "files" and "cards" into same index (no need for 2 indexes) Lucene easily support documents of different structure. You may add some boosting per field or document, and tune similarity to get most important stuff in top. On Tue, Jan 19, 2010 at 16:35, Anna Hunecke wrote: > The field size is restricted to 1 million tokens, because of the very rea= sons you mentioned. > So, even if I have one separate field for the content of a file, I might = reach the limit if the file is really big. But I can't help that. What I wa= nt to avoid is that the whole content of some files can not be found becaus= e I used one field for the content of all files and they just could not be = appended anymore. > > --- Erick Erickson schrieb am Di, 19.1.2010: > >> Von: Erick Erickson >> Betreff: Re: Indexing and Searching linked files >> An: java-user@lucene.apache.org >> Datum: Dienstag, 19. Januar 2010, 14:43 >> What field size limit are you talking >> about here? Because 10,000 >> tokens is the default, but you can increase it to >> Integer.MAX_VALUE. >> >> So are you really talking billions of tokens here? Your >> index >> quickly becomes unmanageable if you're allowing it to grow >> by such increments. >> >> One can argue, IMO, that the first N (10M, say) tokens/file >> is >> "enough" and there's not much real value in the rest, but >> that >> can be a weak argument depending on the problem space.... >> >> But if you're really committed to indexing an unbounded >> number >> of arbitrarily large files...you'll fail. Sometime, >> somewhere, somebody >> will want to index enough to violate whatever limits you >> have (disk, >> memory, time, whatever). So I think you'd be farther ahead >> to ask your >> product manager what limits are reasonable and go from >> there... >> >> HTH >> Erick >> >> On Tue, Jan 19, 2010 at 7:57 AM, Anna Hunecke >> wrote: >> >> > Hi! >> > I have been working with Lucene for a while now. So >> far, I found helpful >> > tips on this list, so I hope somebody can help me with >> my problem: >> > >> > In our app information is grouped in so-called cards. >> Now, it should be >> > made possible to also search on files linked to the >> cards. You can link >> > arbitrarily many files to a card and the size of the >> files is also not >> > restricted. >> > So, as far as I can see, there are two ways to do >> this: >> > >> > 1. Add the content of the files to the search index of >> the card. First, I >> > thought that I could just have an additional field in >> the index which >> > contains the content of all the files. But then, if >> the files are very big, >> > I could hit the field size limit, and would possibly >> not get the content of >> > all files indexed. So, I would need one field per >> file. The problem I have >> > then is that I don't know how many files I have and >> how large the index >> > would get. This is risky, because some customers have >> a lot of data. >> > >> > 2. Create a separate index for files. The documents in >> this index would >> > contain one file each, so I would not have the problem >> that I don't know how >> > many fields I have. But then, the searching is a >> problem: >> > I would need to search on both the card and the >> document index, and somehow >> > merge the results together. I sort by score always, >> but, as I understand it, >> > the scores of the results of two different indexes are >> not comparable. >> > >> > So, which way do you think is better? >> > >> > Best, >> > Anna >> > >> > __________________________________________________ >> > Do You Yahoo!? >> > Sie sind Spam leid? Yahoo! Mail verf=C3=BCgt =C3=BCber einen >> herausragenden Schutz >> > gegen Massenmails. >> > http://mail.yahoo.com >> > >> > >> --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> > For additional commands, e-mail: java-user-help@lucene.apache.org >> > >> > >> > > __________________________________________________ > Do You Yahoo!? > Sie sind Spam leid? Yahoo! Mail verf=C3=BCgt =C3=BCber einen herausragend= en Schutz gegen Massenmails. > http://mail.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org