lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luke Shannon" <lshan...@hypermedia.com>
Subject Re: HTMLParser.getReader returning null
Date Fri, 12 Nov 2004 14:51:16 GMT
Hi;

I am using the HTMLParser that comes with the latest version of Lucene (in
the demo).

Here is the import line:

import org.apache.lucene.demo.html.HTMLParser;

If you have lucene-demos-1.4-final.jar in your class path the system will
find the Parser Class.

I am happy with the results.

Let me know if you need anything else.

L


----- Original Message ----- 
From: "sergiu gordea" <gsergiu@ifit.uni-klu.ac.at>
To: <lshannon@hypermedia.com>
Sent: Friday, November 12, 2004 3:39 AM
Subject: Re: HTMLParser.getReader returning null


> Luke Shannon wrote:
>
>  Hi,
>
>  May I ask you which library you are using for parsing html pages?
>   I need to index html pages and I want to use a good parser to
> eliminate the html tags.
>   Can you recomend me a simple parser that has a demo?
>
>  Thanks,
>
>   Sergiu
>
> >Hello;
> >
> >Things were working fine. I have been re-organizing my code to drop into
QA
> >when I noticed I was no longer getting search results for my HTML files.
> >When I checked things out I confirmed I was still creating the Documents
but
> >realized no content was being indexed.
> >
> > HTMLParser parser = new HTMLParser(f);
> >
> >    // Add the tag-stripped contents as a Reader-valued Text field so it
> >will
> >    // get tokenized and indexed.
> >    doc.add(Field.Text("contents", parser.getReader()));
> >    System.out.println("The content is " + doc.get("contents"));
> >
> >The SOP line above outputs a null where the contents used to be. Any seen
> >this before?
> >
> >Thanks,
> >
> >Luke
> >
> >----- Original Message ----- 
> >From: "Will Allen" <wallen@Cyveillance.com>
> >To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> >Sent: Thursday, November 11, 2004 1:59 PM
> >Subject: RE: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> >Any wildcard search will automatically expand your query to the number of
> >terms it find in the index that suit the wildcard.
> >
> >For example:
> >
> >wild*, would become wild OR wilderness OR wildman etc for each of the
terms
> >that exist in your index.
> >
> >It is because of this, that you quickly reach the 1024 limit of clauses.
I
> >automatically set it to max int with the following line:
> >
> >BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> >
> >
> >-----Original Message-----
> >From: Sanyi [mailto:need4sid@yahoo.com]
> >Sent: Thursday, November 11, 2004 6:46 AM
> >To: lucene-user@jakarta.apache.org
> >Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> >Hi!
> >
> >First of all, I've read about BooleanQuery$TooManyClauses, so I know that
it
> >has a 1024 Clauses
> >limit by default which is good enough for me, but I still think it works
> >strange.
> >
> >Example:
> >I have an index with about 20Million documents.
> >Let's say that there is about 3000 variants in the entire document set of
> >this word mask: cab*
> >Let's say that about 500 documents are containing the word: spectrum
> >Now, when I search for "cab* AND spectrum", I don't expect it to throw an
> >exception.
> >It should first restrict the search for the 500 documents containing the
> >word "spectrum", then it
> >should collect the variants of "cab*" withing these documents, which
turns
> >out in two or three
> >variants of "cab*" (cable, cables, maybe some more) and the search should
> >return let's say 10
> >documents.
> >
> >Similar example: When I search for "cab* AND nonexistingword" it still
> >throws a TooManyClauses
> >exception instead of saying "No results", since there is no
> >"nonexistingword" in my document set,
> >so it doesn't even have to start collecting the variations of "cab*".
> >
> >Is there any path for this issue?
> >Thank you for your time!
> >
> >Sanyi
> >(I'm using: lucene 1.4.2)
> >
> >p.s.: Sorry for re-sending this message, I was first sending it as an
> >accidental reply to a wrong thread..
> >
> >
> >
> >__________________________________
> >Do you Yahoo!?
> >Check out the new Yahoo! Front Page.
> >www.yahoo.com
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> >
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message