nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: Machine readable vs. human readable URLs.
Date Mon, 19 Sep 2011 21:23:05 GMT
> In addition, it looks like you are misinterpreting how the urlmeta plugin
> works Chip. It is designed to pick up addition meta tags with name and a
> content values respectively. e.g.
>
> <meta name="humanURL" content="blahblahblah">
>

Sorry Lewis but it does not do that at all. See link I gave earlier for a
description of urlmeta. I agree that the name is misleading, it does not
extra the content from the page but simply uses the crawldb metadata


>
> The plugin then gets this data as well as any additional values added in
> the
> urlmeta.tags property within nutch-site.xml and add this to the index which
> can then be queried.
>
> Does this make sense?
>
> On Mon, Sep 19, 2011 at 9:10 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
> > Hi
> >
> > Since the info is available thanks to the injection you can use the
> > url-meta
> > plugin as-is and won't need to have a custom version.  See
> > https://issues.apache.org/jira/browse/NUTCH-855
> >
> > Apart from that do not modify the content of  \runtime\local\conf\ before
> > re-compiling with ANT as this will be overwritten. Either modify
> > $NUTCH/conf/nutch-site.xml or recompile THEN modify.
> >
> > As Lewis suggested check the logs and see if the plugin is activated
> etc...
> >
> > J.
> >
> >
> > On 19 September 2011 21:03, Chip Calhoun <ccalhoun@aip.org> wrote:
> >
> > > Hi Lewis,
> > >
> > > My probably wrong understanding was that I'm supposed to add the tags
> for
> > > my new field to my list of seed URLs. So if I have a seed URL followed
> by
> > "
> > >        \t humanURL=http://www.aip.org/history/ead/20110369.html", I
> get
> > a
> > > new field called "humanURL" which is populated with the string I've
> > > specified for that specific URL. I may just be greatly misunderstanding
> > how
> > > this plugin works.
> > >
> > > I've checked my Nutch logs now and it looks like nothing happened. The
> > new
> > > field does at least show up in the Solr admin UI's schema, but clearly
> my
> > > problem is on the Nutch end of things.
> > >
> > > -----Original Message-----
> > > From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.com]
> > > Sent: Monday, September 19, 2011 3:34 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Machine readable vs. human readable URLs.
> > >
> > > Hi Chip,
> > >
> > > There is no need to run ant war, there is no war target in the >= Nutch
> > 1.3
> > > build.xml file.
> > >
> > > Can you explian more about adding 'the tags to %NUTCH_HOME% etc etc. Do
> > you
> > > mean you've added your seed URLs?
> > >
> > > Have you had a look at any of your log output as to whether the urlmeta
> > > plugin is loaded and used when fetching?
> > >
> > > You should be able to get info on your schema, fields etc within the
> Solr
> > > admin UI
> > >
> > > On Mon, Sep 19, 2011 at 8:09 PM, Chip Calhoun <ccalhoun@aip.org>
> wrote:
> > >
> > > > Hi Julien,
> > > >
> > > > Thanks, that's encouraging. I'm trying to make this work, and I'm
> > > > definitely missing something. I hope I'm not too far off the mark.
> > > > I've started with the instructions at
> > > > http://wiki.apache.org/nutch/WritingPluginExample . If I understand
> > > > this properly, the changes I needed to make were the following:
> > > >
> > > > In Nutch:
> > > > Paste the prescribed block of code into
> > > > %NUTCH_HOME%\runtime\local\conf\nutch-site.xml. This tells Nutch to
> > > > look for and run the urlmeta plugin.
> > > > In %NUTCH_HOME%, run "ant war".
> > > > Add the tags to %NUTCH_HOME% \runtime\local\urls\nutch. A line in
> this
> > > file
> > > > now looks like: "http://www.aip.org/history/ead/20110369.xml
>  \t
> > > > humanURL=http://www.aip.org/history/ead/20110369.html"
> > > >
> > > > In Solr:
> > > > Added my new tag to %SOLR_HOME%\example\solr\conf\schema.xml . The
> new
> > > > line consists of: " <field name="humanURL" type="string"
> stored="true"
> > > > indexed="false"/>"
> > > >
> > > > I've redone the indexing, and my new field still doesn't show up in
> > > > the search results. Can you tell where I'm going wrong?
> > > >
> > > > Thanks,
> > > > Chip
> > > >
> > > > -----Original Message-----
> > > > From: Julien Nioche [mailto:lists.digitalpebble@gmail.com]
> > > > Sent: Friday, September 16, 2011 4:37 AM
> > > > To: user@nutch.apache.org
> > > > Subject: Re: Machine readable vs. human readable URLs.
> > > >
> > > > Hi Chip,
> > > >
> > > > Should simply be a matter of creating a custom field with an
> > > > IndexingFilter, you can then use it in any way you want on the SOLR
> > > > side
> > > >
> > > > Julien
> > > >
> > > > On 15 September 2011 21:50, Chip Calhoun <ccalhoun@aip.org> wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > We'd like to use Nutch and Solr to replace an existing Verity
> search
> > > > > that's become a bit long in the tooth. In our Verity search, we
> have
> > > > > a hack which allows each document to have a machine-readable URL
> > > > > which is indexed (generally an xml document), and a human-readable
> > > > > URL which we actually send users to. Has anyone done the same with
> > > Nutch and Solr?
> > > > >
> > > > > Thanks,
> > > > > Chip
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *
> > > > *Open Source Solutions for Text Engineering
> > > >
> > > > http://digitalpebble.blogspot.com/
> > > > http://www.digitalpebble.com
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>
>
>
> --
> *Lewis*
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message