lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Liao Xuefeng" <askxuef...@gmail.com>
Subject RE: HTML text extraction
Date Thu, 22 Jun 2006 23:54:29 GMT
 hi, all,
  I wrote my own html parser because it just meets my require and do not
depend on 3rd part's lib. and i'd like to share it (in attachment).

  This class provides some static methods to do html <-> text convertion:

  HtmlUtil.html2text(String html);
  HtmlUtil.text2html(String text);

and 
  HtmlUtil.removeScriptTags(String html);
can remove script and activex tags in html, this is use to check user's blog
post before writing into database.

Best regards,
  Xuefeng

http://www.crackj2ee.com

-----Original Message-----
From: Michael Wechner [mailto:michael.wechner@wyona.com] 
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John Wang wrote:
> Hi Xuefeng:
>
>     Can you please send me your htmlparser too?

Xuefeng, would it be possible to open source your parser?

Thanks

Michi
>
> thanks
>
> -John
>
> On 6/21/06, Daniel Noll <daniel@nuix.com.au> wrote:
>>
>> Simon Courtenage wrote:
>> > I also use htmlparser, which is rather good.  I've had to customize
>> it,
>> > though, to parse strings containing html source rather than accept 
>> > urls of resources to fetch etc.
>> Also it
>> > crashes on meta tags that don't have name attributes (something I 
>> > discovered only a couple of days ago).
>>
>> Actually, it already accepts strings without modifying the library:
>>
>>     String htmlSource = "<html>...</html>";
>>     Parser parser = new Parser(new Lexer(htmlSource));
>>
>> I will have to watch out for those meta tags though.  Time to go test 
>> it.
>>
>> Daniel
>>
>>
>> --
>> Daniel Noll
>>
>> Nuix Pty Ltd
>> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
>> Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902
>>
>> This message is intended only for the named recipient. If you are not 
>> the intended recipient you are notified that disclosing, copying, 
>> distributing or taking any action in reliance on the contents of this 
>> message or attachment is strictly prohibited.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message