nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "david.stuart@progressivealliance.co.uk" <david.stu...@progressivealliance.co.uk>
Subject Re: solr index question
Date Thu, 15 Oct 2009 19:38:05 GMT
Hi Andrzej,

Patch supplied in ticket

Regards,

Dave

On 13 October 2009 at 23:04 Andrzej Bialecki <ab@getopt.org> wrote:

> david.stuart@progressivealliance.co.uk wrote:
> >   Hi,
> > 
> > I am being to use nutch to crawl site (great stuff btw) and combined it 
> > with solr pushing the nutch index using the solrindex command. I have 
> > set it up as specified on the wiki using the copyField url to id in the 
> > schema. Whilst this works fine it is stuff's up my inputs from other 
> > sources in solr (e.g. using the solr data import handler) as they have 
> > both id's and url's.
> > My question is why was the id field not pushed to solr and this weird 
> > copy field used because you already know it is the id is going to be the 
> > url. Are there any plans to change this or was a design decision made 
> > for other reasons. Could we look at implementing a nutch xml schema 
> > defining what basic nutch fields map to in your solr push. I have hacked 
> > in a fix to the SolrWriter.java but was wondering if it could be worked 
> > through into a long term supported option?
> 
> This comes from the fact that Nutch doesn't really know the schema that 
> you are using in Solr, plus the fact that the functional equivalent of 
> "uniqueKey" in Nutch has always been named "url", which is hardcoded in 
> some places ... so, this is a deficiency in Nutch as well. Please note 
> that the reverse is true as well - SolrSearchBean hardcodes Solr's 
> uniqueKey to "id" instead of using a configurable name.
> 
> I agree that both these places should use configurable names. Can you 
> provide a patch?
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
Mime
View raw message