nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Richard Grantham" <>
Subject Bugs in the subcollections plugin
Date Thu, 03 Sep 2009 10:14:57 GMT
Hi List,

I think I may have discovered a bug or two in the subcollection plugin.

I indexed a couple of sites with the subcollection plugin enabled to
evaluate this feature of Nutch but was unable to search within a
subcollection. I analysed the index with Luke and confirmed the
subcollection had been indexed but I still was unable to search for the
specific value. After some time I noticed a superfluous space at the
beginning of the indexed value, and a quick test showed that adding the
space to my query gave me the results I expected.

I've looked into this and I assume the problem is caused in the
getSubCollections() method of CollectionManager:

  public String getSubCollections(final String url) {
    String collections = "";
    final Iterator iterator = collectionMap.values().iterator();

    while (iterator.hasNext()) {
      final Subcollection subCol = (Subcollection);
      if (subCol.filter(url) != null) {
        collections += " " +;
    if (LOG.isTraceEnabled()) { LOG.trace("subcollections:" +
collections); }
    return collections;

This could be fixed by returning collections.trim() or rewriting the
while loop that builds the string. In addition to this, the code (and
comment) suggests that a single URL could appear in multiple
subcollections. Would I be correct in concluding that, given that
subcollection value is untokenized, it would not be possible to search a
URL that appears in multiple subcollections when specifying a search
within a particular subcollection?



Richard Grantham

Limehouse Software Ltd

DDI:  01628 640 453
Main: 01628 640 401 
Fax:  01628 640 461 

Limehouse Software Ltd
St Cloud Gate 
St Cloud Way 
Cookham Road 
Maidenhead, Berks
SL6 8XD - Unifying Information

Limehouse Software Limited - An Objective Company

The information contained in this e-mail or in any attachments is confidential and is intended
solely for the named addressee only. Access to this e-mail by anyone else is unauthorised.
If you are not the intended recipient, please notify Limehouse Software Ltd immediately by
returning this e-mail to sender or calling 01628 640 401 and do not read, use or disseminate
the information. Opinions expressed in this e-mail are those of the sender and not necessarily
the company. Although an active anti-virus policy is operated, the company accepts no liability
for any damage caused by any virus transmitted by this e-mail, including any attachments.

View raw message