lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Shannon" <>
Subject HTMLParser.getReader returning null
Date Thu, 11 Nov 2004 19:17:13 GMT

Things were working fine. I have been re-organizing my code to drop into QA
when I noticed I was no longer getting search results for my HTML files.
When I checked things out I confirmed I was still creating the Documents but
realized no content was being indexed.

 HTMLParser parser = new HTMLParser(f);

    // Add the tag-stripped contents as a Reader-valued Text field so it
    // get tokenized and indexed.
    doc.add(Field.Text("contents", parser.getReader()));
    System.out.println("The content is " + doc.get("contents"));

The SOP line above outputs a null where the contents used to be. Any seen
this before?



----- Original Message ----- 
From: "Will Allen" <>
To: "Lucene Users List" <>
Sent: Thursday, November 11, 2004 1:59 PM
Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses

Any wildcard search will automatically expand your query to the number of
terms it find in the index that suit the wildcard.

For example:

wild*, would become wild OR wilderness OR wildman etc for each of the terms
that exist in your index.

It is because of this, that you quickly reach the 1024 limit of clauses.  I
automatically set it to max int with the following line:

BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );

-----Original Message-----
From: Sanyi []
Sent: Thursday, November 11, 2004 6:46 AM
Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses


First of all, I've read about BooleanQuery$TooManyClauses, so I know that it
has a 1024 Clauses
limit by default which is good enough for me, but I still think it works

I have an index with about 20Million documents.
Let's say that there is about 3000 variants in the entire document set of
this word mask: cab*
Let's say that about 500 documents are containing the word: spectrum
Now, when I search for "cab* AND spectrum", I don't expect it to throw an
It should first restrict the search for the 500 documents containing the
word "spectrum", then it
should collect the variants of "cab*" withing these documents, which turns
out in two or three
variants of "cab*" (cable, cables, maybe some more) and the search should
return let's say 10

Similar example: When I search for "cab* AND nonexistingword" it still
throws a TooManyClauses
exception instead of saying "No results", since there is no
"nonexistingword" in my document set,
so it doesn't even have to start collecting the variations of "cab*".

Is there any path for this issue?
Thank you for your time!

(I'm using: lucene 1.4.2)

p.s.: Sorry for re-sending this message, I was first sending it as an
accidental reply to a wrong thread..

Do you Yahoo!?
Check out the new Yahoo! Front Page.

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message