nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject FW: Fetcher, ParseText, ParseData - need to modify
Date Mon, 15 Aug 2005 17:26:48 GMT
1. This is part of ParseText:
Any Accessories Backup Devices & Media Barebone Systems Camcorder
Accessories Camcorders Cases & External Enclosures CD / DVD Drives &
Media Cooling Devices Digital Camera Accessories Digital Cameras

- it is content of Dropdown, <OPTIONS> in HTML


2. I have some sub-text in ParseText which seems to be an anchor, I
compared visually with web-page...


-----Original Message-----
From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: Monday, August 15, 2005 1:20 PM
To: nutch-dev@lucene.apache.org
Subject: Fetcher, ParseText, ParseData - need to modify


I just catched some output from Fetcher.FetcherThread.outputPage(.) and
noticed that some anchors are in a text, and some <OPTIONS> tags within
a text too.
          LOG.info("ParseText = "+text);
          LOG.info("ParseData = "+ parseData);

I'd like to modify behaviour, ParseText should contain subset of a text
which I need, and ParseData should contain all anchors.

Where to start? Would be nice to have plugins modifying Fetcher
behaviour...


Mime
View raw message