Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Received-SPF: pass (hermes.apache.org: local policy)
Content-Class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Subject: RE: HTMLParser.getReader returning null
Date: Fri, 12 Nov 2004 11:41:34 +0100
Message-ID: <950FF7DE40C2B64CAF80B564732927E10AB33A@exchange.be.bvd>
Thread-Topic: HTMLParser.getReader returning null
Thread-Index: AcTIIwaEQqgarKmQTZaVg9CgTH7k1gAeoSZA
From: "Vanlerberghe, Luc" <Luc.Vanlerberghe@bvdep.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>

If you use the Field.Text(String name, Reader value) version of the
Field.Text constructor, the field is tokenized and indexed but *not*
stored.  This means you will be able to search and find that document,
but to know the original contents you will have to store a copy of it
elsewhere.

The Field.Text(String name, String value) version does store the
document String itself, so that's probably the origin of the confusion.

> -----Original Message-----
> From: Luke Shannon [mailto:lshannon@hypermedia.com]=20
> Sent: donderdag 11 november 2004 20:17
> To: Lucene Users List
> Subject: HTMLParser.getReader returning null
>=20
> Hello;
>=20
> Things were working fine. I have been re-organizing my code=20
> to drop into QA when I noticed I was no longer getting search=20
> results for my HTML files.
> When I checked things out I confirmed I was still creating=20
> the Documents but realized no content was being indexed.
>=20
>  HTMLParser parser =3D new HTMLParser(f);
>=20
>     // Add the tag-stripped contents as a Reader-valued Text=20
> field so it will
>     // get tokenized and indexed.
>     doc.add(Field.Text("contents", parser.getReader()));
>     System.out.println("The content is " + doc.get("contents"));
>=20
> The SOP line above outputs a null where the contents used to=20
> be. Any seen this before?
>=20
> Thanks,
>=20
> Luke
>=20
> ----- Original Message -----
> From: "Will Allen" <wallen@Cyveillance.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Thursday, November 11, 2004 1:59 PM
> Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
>=20
>=20
> Any wildcard search will automatically expand your query to=20
> the number of
> terms it find in the index that suit the wildcard.
>=20
> For example:
>=20
> wild*, would become wild OR wilderness OR wildman etc for=20
> each of the terms
> that exist in your index.
>=20
> It is because of this, that you quickly reach the 1024 limit=20
> of clauses.  I
> automatically set it to max int with the following line:
>=20
> BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
>=20
>=20
> -----Original Message-----
> From: Sanyi [mailto:need4sid@yahoo.com]
> Sent: Thursday, November 11, 2004 6:46 AM
> To: lucene-user@jakarta.apache.org
> Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
>=20
>=20
> Hi!
>=20
> First of all, I've read about BooleanQuery$TooManyClauses, so=20
> I know that it
> has a 1024 Clauses
> limit by default which is good enough for me, but I still=20
> think it works
> strange.
>=20
> Example:
> I have an index with about 20Million documents.
> Let's say that there is about 3000 variants in the entire=20
> document set of
> this word mask: cab*
> Let's say that about 500 documents are containing the word: spectrum
> Now, when I search for "cab* AND spectrum", I don't expect it=20
> to throw an
> exception.
> It should first restrict the search for the 500 documents=20
> containing the
> word "spectrum", then it
> should collect the variants of "cab*" withing these=20
> documents, which turns
> out in two or three
> variants of "cab*" (cable, cables, maybe some more) and the=20
> search should
> return let's say 10
> documents.
>=20
> Similar example: When I search for "cab* AND nonexistingword" it still
> throws a TooManyClauses
> exception instead of saying "No results", since there is no
> "nonexistingword" in my document set,
> so it doesn't even have to start collecting the variations of "cab*".
>=20
> Is there any path for this issue?
> Thank you for your time!
>=20
> Sanyi
> (I'm using: lucene 1.4.2)
>=20
> p.s.: Sorry for re-sending this message, I was first sending it as an
> accidental reply to a wrong thread..
>=20
>=20
>=20
> __________________________________
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
>=20
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>=20
>=20
>=20
>=20
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>=20
>=20

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org