lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: HTML text extraction
Date Thu, 22 Jun 2006 15:29:42 GMT
John Wang wrote:
> Hi Xuefeng:
>
>     Can you please send me your htmlparser too?

Xuefeng, would it be possible to open source your parser?

Thanks

Michi
>
> thanks
>
> -John
>
> On 6/21/06, Daniel Noll <daniel@nuix.com.au> wrote:
>>
>> Simon Courtenage wrote:
>> > I also use htmlparser, which is rather good.  I've had to customize 
>> it,
>> > though, to parse strings containing
>> > html source rather than accept urls of resources to fetch etc.  
>> Also it
>> > crashes on meta tags that don't have
>> > name attributes (something I discovered only a couple of days ago).
>>
>> Actually, it already accepts strings without modifying the library:
>>
>>     String htmlSource = "<html>...</html>";
>>     Parser parser = new Parser(new Lexer(htmlSource));
>>
>> I will have to watch out for those meta tags though.  Time to go test 
>> it.
>>
>> Daniel
>>
>>
>> -- 
>> Daniel Noll
>>
>> Nuix Pty Ltd
>> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
>> Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902
>>
>> This message is intended only for the named recipient. If you are not
>> the intended recipient you are notified that disclosing, copying,
>> distributing or taking any action in reliance on the contents of this
>> message or attachment is strictly prohibited.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message