lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Craig Walls <wal...@michaels.com>
Subject Re: HTML Analyzer?
Date Thu, 14 Nov 2002 21:18:17 GMT
Oh wait...I just re-read your original post and apparently I misunderstood (I had
just found a hammer and at a glance
your problem looked like a nail). Sorry about that.

But it's not entirely unrelated...You *could* write your own Analyzer that uses
the HTMLEditorKit stuff to only index
the text of your HTML documents and not the tags. I'm not sure if that's exactly
what you want, but it's an option.

Regarding the HTMLEditorKit and related stuff, it's not that hard, actually. It's
not entirely unlike using a SAX parser to
parse XML documents. The gist of it is as follows:

     First you create a class that extends HTMLEditorKit.ParserCallback. By
default all of the "handleXXX()"
     methods are empty implementations, so override the ones you care about. In
my case, I only needed to read
     comments and text, so I overrode the handleComment() and handleText()
methods.
     Now, in the code that will use all of this, I started by opening a
java.io.FileReader on my HTML document.
     Next, I created an instance of ParserDelegator (I've not delved into the API
code yet to know why I had to do
     this).
     Then I called the parse() method of my ParserDelegator, passing it the
reader, an instnace of my ParserCallback
     class, and a boolean true.
     At that time, the parser took over and started making calls back to the
handleXXX() methods that I wrote.


"Lichty, Kent" wrote:

  Well, let me know if you figure it out and I will do the same.  I don't
  quite understand how those classes would help out.  Would you somehow use
  them to create the Reader object that is passed to create the TokenStream
  object?

"Lichty, Kent" wrote:

> Well, let me know if you figure it out and I will do the same.  I don't
> quite understand how those classes would help out.  Would you somehow use
> them to create the Reader object that is passed to create the TokenStream
> object?
>
> -----Original Message-----
> From: Craig Walls [mailto:wallsc@michaels.com]
> Sent: Thursday, November 14, 2002 2:39 PM
> To: Lucene Users List
> Subject: Re: HTML Analyzer?
>
> Ironically, I just had to solve this exact problem just 10 minutes ago...
>
> Check into javax.swing.text.html.HTMLEditorKit and
> javax.swing.text.html.HTMLDocument. Here's a URL that I found helpful (the
> site
> is Japanese, but the source code is still Java):
>
> http://java-house.jp/ml/archive/j-h-b/037727.html?#_body
>
> "Lichty, Kent" wrote:
>
> > We have a web application that builds pages "on the fly" by reading
> directly
> > from a database. The database contains both normal content and HTML.  We
> use
> > Lucene as our search engine, but I need to figure out how to cause it to
> NOT
> > include content that is within HTML tags. I assume that this entails the
> > creation of a custom Analyzer.  Are there any existing Analyzers already
> out
> > there that work like this? Thanks!
> >
> > ----------  Internet E-mail Confidentiality Disclaimer  ----------
> >
> > PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message.
> If
> > you are not the addressee indicated in this message or the employee or
> agent
> > responsible for delivering it to the addressee, you are hereby on notice
> > that you are in possession of confidential and privileged information.
> Any
> > dissemination, distribution, or copying of this e-mail is strictly
> > prohibited.  In such case, you should destroy this message and kindly
> notify
> > the sender by reply e-mail.  Please advise immediately if you or your
> > employer do not consent to Internet email for messages of this kind.
> >
> > Opinions, conclusions, and other information in this message that do not
> > relate to the official business of my firm shall be understood as neither
> > given nor endorsed by it.
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> > For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
> --
> To unsubscribe, e-mail:
> <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-user-help@jakarta.apache.org>
>
> ----------  Internet E-mail Confidentiality Disclaimer  ----------
>
> PRIVILEGED / CONFIDENTIAL INFORMATION may be contained in this message.  If
> you are not the addressee indicated in this message or the employee or agent
> responsible for delivering it to the addressee, you are hereby on notice
> that you are in possession of confidential and privileged information.  Any
> dissemination, distribution, or copying of this e-mail is strictly
> prohibited.  In such case, you should destroy this message and kindly notify
> the sender by reply e-mail.  Please advise immediately if you or your
> employer do not consent to Internet email for messages of this kind.
>
> Opinions, conclusions, and other information in this message that do not
> relate to the official business of my firm shall be understood as neither
> given nor endorsed by it.
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message