From java-user-return-48081-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Sun Dec 12 02:47:12 2010 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 46247 invoked from network); 12 Dec 2010 02:47:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Dec 2010 02:47:12 -0000 Received: (qmail 60624 invoked by uid 500); 12 Dec 2010 02:47:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60536 invoked by uid 500); 12 Dec 2010 02:47:10 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60528 invoked by uid 99); 12 Dec 2010 02:47:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 02:47:09 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of celsowm@gmail.com designates 209.85.213.48 as permitted sender) Received: from [209.85.213.48] (HELO mail-yw0-f48.google.com) (209.85.213.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 02:47:04 +0000 Received: by ywo7 with SMTP id 7so3206316ywo.35 for ; Sat, 11 Dec 2010 18:46:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type :content-transfer-encoding; bh=k+W29+6ZFA4IwHSSZ8fbAJh455JUyEMmsdRpFPKvWJ8=; b=FzU6pGijHIVZs/cdbkNaR3mHv9zO7CgEihppBpM0OPFlxSNNoSLpR7Q2+g2xSqturk 9OMxQefQqApeeoJRjrrjcQlmgMp7L9Hy5y63PbRXTCyHpzR3fKWw3K0IbmgUBVMWCx7X uuApAqmEMRCo8IdyoKk1eDx1QsYGfnjJHTToE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=BYtsYV9VsvSDYfF6M7ykiNJCEjPufSrorQGaj0L/6bFL7dTn70bYwuN+82zdqlgTVN uB1eCgzqdtFYYEZeG8LzuNybTjX0wyHrMqldFAs3U4YWf01+C5hvW2oisNOJFH7Qc5qM 1N27IYjaPoEI+bCxwneBeAhe1gOPo1ZjlUrBM= Received: by 10.236.95.140 with SMTP id p12mr5594002yhf.24.1292122003788; Sat, 11 Dec 2010 18:46:43 -0800 (PST) MIME-Version: 1.0 Received: by 10.236.95.34 with HTTP; Sat, 11 Dec 2010 18:46:23 -0800 (PST) In-Reply-To: References: From: Celso Fontes Date: Sun, 12 Dec 2010 00:46:23 -0200 Message-ID: Subject: Re: Problems with "tagged" and "non tagged" text To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Dear Erick, Sorry i am using really "AND" operator, i wrote wrong in email (i am very tired)... But..Follow the 'main' part of code: Document document =3D new Document(); String path =3D file.getCanonicalPath(); document.add(new Field("title", path, Field.Store.YES, Field.Index.ANALYZED)); Reader reader =3D new FileReader(file); document.add(new Field("content", reader)); As you can see I do indexing ! and... with the others questions, i have a good result with htm files...this htm, for example, is good for this question: ******APC (adenomatous polyposis coli) Colon Cancer Thanks, Celso. 2010/12/12 Erick Erickson : > Unless you provide details on how you are indexing these documents, > it's pretty hard to help. > > It's also hard to reconcile your statement that OR is the default operato= r > with > the results you posted, the '+' all over the place really points to AND > as the default. > > There's no magic in Lucene that will automatically put the "content" of > an (X)HTM document in the content field of your document, how are you > insuring that the doc is indexed as you expect? > > Luke is a very valuable tool for inspecting your index to see if it is wh= at > you think it is... > > Best > Erick > > On Sat, Dec 11, 2010 at 8:34 PM, Celso Fontes wrote: > >> Hi, i have the same text in two files: >> >> ****TXT =A0 =A0 =A0file: http://pastebin.com/u9Rd9VVA >> ****(X)HTM file: http://pastebin.com/ydHmTQZ8 >> >> And i running this Question: >> >> =A0 APC (adenomatous polyposis coli) actin assembly >> >> with OR operator and SNOWBALL Analyser results in: >> >> =A0 =A0+content:apc +(+content:adenomat +content:polyposi +content:coli) >> +content:actin +content:assembl >> >> >> But... only txt returns ok, why? >> >> >> ps: if i try without "()" i got the same result.... >> Thanks, >> Celso >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org