nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a a <mbel...@msn.com>
Subject RE: parse-html plugin
Date Wed, 02 Feb 2011 03:28:30 GMT

i want to know if some one did this job before , mabe he could tell us if it will take more
time  (double time) when using another HtmlParsefilter to overwrite  the original ParseResult
  object produced by the parse-html plugin.

thx


mehdi




> From: markus.jelsma@openindex.io
> To: user@nutch.apache.org
> Subject: Re: parse-html plugin
> Date: Wed, 2 Feb 2011 02:46:47 +0100
> CC: ab1sh3k@gmail.com
> 
> Oh well, please come back with your experience and results on this issue in 
> this thread. More users will benefit =)
> 
> > I am sorry, forgive my ignorance. I got the answer for it :) Thanks for
> > your time
> > 
> > On Wed, Feb 2, 2011 at 9:28 AM, .: Abhishek :. <ab1sh3k@gmail.com> wrote:
> > > Hi,
> > > 
> > >  Just wondering what does the dumpText mean in the ParseChecker?
> > >  
> > >  On the same grounds, incase I am writing a custom filter that extends
> > >  the
> > > 
> > > HtmlParseFilter..do I have to make any configuration changes for nutch?
> > > 
> > > Thanks,
> > > Abi
> > > 
> > > On Wed, Feb 2, 2011 at 2:04 AM, Markus Jelsma 
> <markus.jelsma@openindex.io>wrote:
> > >> I'm not really sure but i believe you must overwrite the already parsed
> > >> data
> > >> yourself in your filter.
> > >> 
> > >> On Tuesday 01 February 2011 18:54:32 a a wrote:
> > >> > Thx for your reply :)
> > >> > 
> > >> > so if i extend the org.apache.nutch.parse.HtmlParsefilter is it going
> > >> > to overwrite to ParseResult  varaible of the original plugin
> > >> > parser-html ?
> > >> > 
> > >> > is it not going to spend more time doing twice the operation of
> > >> 
> > >> extracting
> > >> 
> > >> > the html source code of each url to parse it  (first time the original
> > >> > parse-html plugin and the seconde time my new plugin ) ??
> > >> > 
> > >> > thx a lot
> > >> > 
> > >> > mehdi
> > >> > 
> > >> > > From: markus.jelsma@openindex.io
> > >> > > To: user@nutch.apache.org
> > >> > > Subject: Re: parse-html plugin
> > >> > > Date: Tue, 1 Feb 2011 18:42:51 +0100
> > >> > > CC: mbellil@msn.com
> > >> > > 
> > >> > > Oh, i forgot. You could extend
> > >> > > org.apache.nutch.parse.HtmlParsefilter. Then you can retrieve
> > >> > > whatever you need and store it in the
> > >> 
> > >> ParseResult
> > >> 
> > >> > > object.
> > >> > > 
> > >> > > On Tuesday 01 February 2011 15:25:20 a a wrote:
> > >> > > > hi,
> > >> > > > 
> > >> > > > is my question so difficult ?
> > >> > > > no one have an idea ?
> > >> > > > 
> > >> > > > thx
> > >> > > > 
> > >> > > > 
> > >> > > > mehdi
> > >> > > > 
> > >> > > > > From: mbellil@msn.com
> > >> > > > > To: user@nutch.apache.org
> > >> > > > > Subject: RE: parse-html plugin
> > >> > > > > Date: Mon, 31 Jan 2011 16:05:22 +0000
> > >> > > > > 
> > >> > > > > 
> > >> > > > > Hi All,
> > >> > > > > 
> > >> > > > > any  idea ?
> > >> > > > > 
> > >> > > > > 
> > >> > > > > 
> > >> > > > > mehdi
> > >> > > > > 
> > >> > > > > > From: mbellil@msn.com
> > >> > > > > > To: user@nutch.apache.org
> > >> > > > > > Subject: parse-html plugin
> > >> > > > > > Date: Thu, 27 Jan 2011 18:58:36 +0000
> > >> > > > > > 
> > >> > > > > > 
> > >> > > > > > hi,
> > >> > > > > > In the class HtmlParser I changed the 'text' variable
to index
> > >> 
> > >> only
> > >> 
> > >> > > > > > a part of my html page, and since i did lost lot
off outlinks
> > >> > > > > > !
> > >> > > > > > 
> > >> > > > > > ...
> > >> > > > > > 
> > >> > > > > >  utils.getText(sb,extractIndexableContent(root));
 //added on
> > >> > > > > >  26-01-2011 to extract only text inside <col_centre>
> > >> > > > > >  
> > >> > > > > >   // utils.getText(sb, root);          // extract
text   ---
> > >> > > > > >   disabled on 26-01-2011-
> > >> > > > > >   
> > >> > > > > >       text = sb.toString();
> > >> > > > > > 
> > >> > > > > > ...
> > >> > > > > > 
> > >> > > > > > i beleived that outlinks are not obtained from
the text
> > >> > > > > > variable
> > >> 
> > >> ?!
> > >> 
> > >> > > > > >  in the same class we could see how outlinks are
extracted !
> > >> > > > > > 
> > >> > > > > > ArrayList<Outlink> l = new ArrayList<Outlink>();
  // extract
> > >> > > > > > outlinks
> > >> > > > > > 
> > >> > > > > >       URL baseTag = utils.getBase(root);
> > >> > > > > >       if (LOG.isTraceEnabled()) { LOG.trace("Getting
> > >> > > > > >       links...");
> > >> 
> > >> }
> > >> 
> > >> > > > > >       utils.getOutlinks(baseTag!=null?baseTag:base,
l, root);
> > >> > > > > >       outlinks = l.toArray(new Outlink[l.size()]);
> > >> > > > > > 
> > >> > > > > > can you plz tell me what i did wrong.
> > >> > > > > > 
> > >> > > > > > 
> > >> > > > > > mehdi
> > >> 
> > >> --
> > >> Markus Jelsma - CTO - Openindex
> > >> http://www.linkedin.com/in/markus17
> > >> 050-8536620 / 06-50258350
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message