nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alx...@aim.com
Subject Re: parse and solrindex in nutch-2.0
Date Tue, 03 Jul 2012 20:19:05 GMT
Hi,

I was planning to parse img tags from a url content and put it in metadata filed of Webpage
storage class in nutch2.0 to retrieve them later  in the indexing step.
However, since there is no metadata data type variable in Parse class (compare with outlinks)
this can not be done in nutch 2.0 (compare parse class with metadata type variable in nutch
1.X). One is restricted to use putToMetadata function of WebPage class which overwrites values,
i.e.,if I try to put two metadata img_alt:alt1 img_alt:alt2  I get only the last value img_alt:alt2
in metadata field.

So, my question is how img tag alt values can be indexed in nutch-2.0, provided that there
are more than one img tag in all crawled urls?
Do I need to parse them and store in one of the fields of webpage storage class or this step
is not needed?

Thanks.
Alex.



-----Original Message-----
From: Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>
To: user <user@nutch.apache.org>
Sent: Tue, Jul 3, 2012 5:08 am
Subject: Re: parse and solrindex in nutch-2.0


Hi,

On Mon, Jul 2, 2012 at 8:21 PM,  <alxsss@aim.com> wrote:

> Regarding the metadata, what would be a proper way of parsing end indexing 
multivalued tags in nutch-2.0 then?
>

Assuming you've taken a look into the schema, 'some' mutivalued fields
are permitted out of the box. Are you having problems obtaining
multiple values for some fields within the documents your trying to
parse + index?

Lewis

 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message