lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: which HTML parser is better? - Thread closed
Date Thu, 03 Feb 2005 10:20:28 GMT
Thank you, I will do that.

> Karl Koch wrote:
> 
> >I appologise in advance, if some of my writing here has been said before.
> >The last three answers to my question have been suggesting pattern
> matching
> >solutions and Swing. Pattern matching was introduced in Java 1.4 and
> Swing
> >is something I cannot use since I work with Java 1.1 on a PDA.
> >  
> >
> I see,
> 
> In this case you can read line by line your HTML file and then write 
> something like this:
> 
> String line;
> int startPos, endPos;
> StringBuffer text = new StringBuffer();
> while((line = reader.readLine()) != null   ){
>     startPos = line.indexOf(">");
>     endPos = line.indexOf("<");
>     if(startPos >0 && endPos > startPos)
>           text.append(line.substring(startPos, endPos));
> }
> 
> This is just a sample code that should work if you have just one tag per 
> line in the HTML file.
> This can be a start point for you.
> 
>   Hope it helps,
> 
>  Best,
> 
>  Sergiu
> 
> >I am wondering if somebody knows a piece of simple sourcecode with low
> >requirement which is running under this tense specification.
> >
> >Thank you all,
> >Karl
> >
> >  
> >
> >>No one has yet mentioned using ParserDelegator and ParserCallback that 
> >>are part of HTMLEditorKit in Swing.  I have been successfully using 
> >>these classes to parse out the text of an HTML file.  You just need to 
> >>extend HTMLEditorKit.ParserCallback and override the various methods 
> >>that are called when different tags are encountered.
> >>
> >>
> >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
> >>
> >>    
> >>
> >>>Three HTML parsers(Lucene web application
> >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>Lucene FAQ
> >>>1.3.27.Which is the best?Can it filter tags that are
> >>>auto-created by MS-word 'Save As HTML files' function?
> >>>      
> >>>
> >>-- 
> >>Bill Tschumy
> >>Otherwise -- Austin, TX
> >>http://www.otherwise.com
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message