Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 12743 invoked from network); 12 Nov 2004 10:41:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 12 Nov 2004 10:41:45 -0000 Received: (qmail 7555 invoked by uid 500); 12 Nov 2004 10:41:40 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 7531 invoked by uid 500); 12 Nov 2004 10:41:40 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 7517 invoked by uid 99); 12 Nov 2004 10:41:39 -0000 Received-SPF: pass (hermes.apache.org: local policy) Received: from [193.194.158.19] (HELO exchange.bvdep.com) (193.194.158.19) by apache.org (qpsmtpd/0.28) with ESMTP; Fri, 12 Nov 2004 02:41:39 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.0.6603.0 Content-Class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: quoted-printable Subject: RE: HTMLParser.getReader returning null Date: Fri, 12 Nov 2004 11:41:34 +0100 Message-ID: <950FF7DE40C2B64CAF80B564732927E10AB33A@exchange.be.bvd> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: HTMLParser.getReader returning null Thread-Index: AcTIIwaEQqgarKmQTZaVg9CgTH7k1gAeoSZA From: "Vanlerberghe, Luc" To: "Lucene Users List" X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N If you use the Field.Text(String name, Reader value) version of the Field.Text constructor, the field is tokenized and indexed but *not* stored. This means you will be able to search and find that document, but to know the original contents you will have to store a copy of it elsewhere. The Field.Text(String name, String value) version does store the document String itself, so that's probably the origin of the confusion. > -----Original Message----- > From: Luke Shannon [mailto:lshannon@hypermedia.com]=20 > Sent: donderdag 11 november 2004 20:17 > To: Lucene Users List > Subject: HTMLParser.getReader returning null >=20 > Hello; >=20 > Things were working fine. I have been re-organizing my code=20 > to drop into QA when I noticed I was no longer getting search=20 > results for my HTML files. > When I checked things out I confirmed I was still creating=20 > the Documents but realized no content was being indexed. >=20 > HTMLParser parser =3D new HTMLParser(f); >=20 > // Add the tag-stripped contents as a Reader-valued Text=20 > field so it will > // get tokenized and indexed. > doc.add(Field.Text("contents", parser.getReader())); > System.out.println("The content is " + doc.get("contents")); >=20 > The SOP line above outputs a null where the contents used to=20 > be. Any seen this before? >=20 > Thanks, >=20 > Luke >=20 > ----- Original Message ----- > From: "Will Allen" > To: "Lucene Users List" > Sent: Thursday, November 11, 2004 1:59 PM > Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses >=20 >=20 > Any wildcard search will automatically expand your query to=20 > the number of > terms it find in the index that suit the wildcard. >=20 > For example: >=20 > wild*, would become wild OR wilderness OR wildman etc for=20 > each of the terms > that exist in your index. >=20 > It is because of this, that you quickly reach the 1024 limit=20 > of clauses. I > automatically set it to max int with the following line: >=20 > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE ); >=20 >=20 > -----Original Message----- > From: Sanyi [mailto:need4sid@yahoo.com] > Sent: Thursday, November 11, 2004 6:46 AM > To: lucene-user@jakarta.apache.org > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses >=20 >=20 > Hi! >=20 > First of all, I've read about BooleanQuery$TooManyClauses, so=20 > I know that it > has a 1024 Clauses > limit by default which is good enough for me, but I still=20 > think it works > strange. >=20 > Example: > I have an index with about 20Million documents. > Let's say that there is about 3000 variants in the entire=20 > document set of > this word mask: cab* > Let's say that about 500 documents are containing the word: spectrum > Now, when I search for "cab* AND spectrum", I don't expect it=20 > to throw an > exception. > It should first restrict the search for the 500 documents=20 > containing the > word "spectrum", then it > should collect the variants of "cab*" withing these=20 > documents, which turns > out in two or three > variants of "cab*" (cable, cables, maybe some more) and the=20 > search should > return let's say 10 > documents. >=20 > Similar example: When I search for "cab* AND nonexistingword" it still > throws a TooManyClauses > exception instead of saying "No results", since there is no > "nonexistingword" in my document set, > so it doesn't even have to start collecting the variations of "cab*". >=20 > Is there any path for this issue? > Thank you for your time! >=20 > Sanyi > (I'm using: lucene 1.4.2) >=20 > p.s.: Sorry for re-sending this message, I was first sending it as an > accidental reply to a wrong thread.. >=20 >=20 >=20 > __________________________________ > Do you Yahoo!? > Check out the new Yahoo! Front Page. > www.yahoo.com >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org >=20 >=20 >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org >=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org