nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewis john mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: custom extractor
Date Fri, 08 Jul 2011 20:25:38 GMT
Hi C.B.,

Your description gets slightly cloudy towards the end e.g. around "One
diffuculty with my htmlcleaner...taken from firebug"???

Are you trying to say that some of the URLs are bad HTML, you know this
because it is flagged up by firebug? If this is the case are you able to
edit the HTML and make it well-formed so to speak?

It would also be of great help if you could post a small suggestion of the
type of xpath extraction you are looking to so, if anyone has built plugins
implementing xpath (which I have not) then they may be able to comment
further.




On Wed, Jul 6, 2011 at 5:10 PM, Cam Bazz <cambazz@gmail.com> wrote:

> Hello,
>
> Previously I have build a primitive crawler in java, extracting
> certain information per html page using xpaths. then I discovered
> nutch, and now I want to be able to extract certain elements in dom,
> tru xpath, multiple xpaths per site.
>
> I am crawling a number of web sites, lets say 16, and I would like to
> be able to write multiple xpaths per site, and then index the output
> of those extractions in solr, as a different field.
>
> I have googled for a while, and I understand certain plugin can be
> developed that will act as a custom html parser. I understand that
> another path is using tika.
>
> I also have experimented with boilerpiple library, and It was
> insufficient to extract the data I want. (I am extracting
> specificiations of certain products, usually in tables, and
> fragmented)
>
> One diffuculty with my htmlcleaner based xpath evaluator was that the
> real world htmls sometime were broken, and even when I cleaned them
> html cleaner will not find xpaths taken from firebug.
>
> Which way should I start?
>
> Any ideas / help / recomendation greatly appreciated,
>
> Best Regards,
> C.B.
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message