lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter W." <pe...@marketingbrokers.com>
Subject Re: How to not tokenize HTML tag from input string
Date Thu, 08 Feb 2007 19:07:47 GMT
Hello,

Using a parser to get text out of HTML, XML (including RSS, ATOM) is  
only
easy if you control the source documents.

HTML pages in the wild are much different, generating exceptions you  
must
catch and deal with. For most projects you can probably use  
java.util.regex
to obtain keywords.

Clean-up extracted text first then send to Lucene.

Regards,

Peter



On Feb 8, 2007, at 12:34 AM, Chris Hostetter wrote:

>
> Solr has an HTMLStripReader used by an two different tokenizers for  
> doing
> the basics of ignoring tags when reading text ... it has one known bug
> when dealing with highlighting...
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ 
> HTMLStripReader.html
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ 
> HTMLStripStandardTokenizerFactory.html
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ 
> HTMLStripWhitespaceTokenizerFactory.html
> http://issues.apache.org/jira/browse/SOLR-42
>
>
> : Date: Wed, 7 Feb 2007 17:04:54 -0800 (PST)
> : From: Joe Tang <joe.tang@workmetro.com>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: How to not tokenize HTML tag from input string
> :
> :
> : My work is to index keywords with a document. In my case, the  
> document is
> : made up with HTML tags which i don't want to index them.
> :
> : For example:
> : Input Document:
> : <div id="tp-wrapper">
> : <span id="tp-top-right">You are welcome</span>
> : <div id="tp-tab">
> : <h1>Testing text</h1>
> : </div>
> : </div>
> :
> : Expected Keywords:
> : keywords:You
> : keywords:are
> : keywords:welcome
> : keywords:Testing
> : keywords:text
> :
> : Is there anyway I can make them not to be one of the keywords?
> : --
> : View this message in context: http://www.nabble.com/How-to-not- 
> tokenize-HTML-tag-from-input-string-tf3190778.html#a8857789
> : Sent from the Lucene - Java Users mailing list archive at  
> Nabble.com.
> :
> :
> :  
> ---------------------------------------------------------------------
> : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : For additional commands, e-mail: java-user-help@lucene.apache.org
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message