nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriele Kahlout (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-971) IndexMerger produces indexes itself cannot merge anymore
Date Sat, 26 Mar 2011 11:12:05 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gabriele Kahlout updated NUTCH-971:
-----------------------------------

    Attachment: IndexMerger-part.diff

Checks if output path ends in a part dir and if not adds it.

> IndexMerger produces indexes itself cannot merge anymore
> --------------------------------------------------------
>
>                 Key: NUTCH-971
>                 URL: https://issues.apache.org/jira/browse/NUTCH-971
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Gabriele Kahlout
>            Priority: Minor
>              Labels: patch
>             Fix For: 1.3
>
>         Attachments: IndexMerger-part.diff
>
>
> Here's what I do:
> 1. index the fetched segs
> $ rm -r $new_indexes $temp_indexes
> $ bin/nutch index $new_indexes $it_crawldb crawl/linkdb crawl/segments/*
>  
> I examine the index with luke and it's as expected.
> 2. merge the new index with the previous
> $ bin/nutch merge $temp_indexes $new_indexes $indexes
> IndexMerger: starting at 2011-03-26 10:24:58
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:24:59, elapsed: 00:00:01
> On the first iteration, when $indexes is empty it works fine by essentially duplicating
 $new_indexes into $temp_indexes.
> But on the 2nd iteration, after I mv $temp_indexes $indexes[1] the merged index $temp_indexes
contains only #new_indexes and nothing from $indexes, which indeed still contains the data
from the previous iteration. That is, it doesn't merge.
> This unexpected merge behavior is NOT symmetric, i.e.
> $ bin/nutch merge $temp_indexes $indexes $new_indexes
> IndexMerger: starting at 2011-03-26 10:32:15
> IndexMerger: merging indexes to: crawl/temp_indexes
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 10:32:16, elapsed: 00:00:01
> The morale of the story is that a merged index cannot be merged with another, i.e. bin/nutch
merge is meant to  merge only 2 indeces generated with bin/nutch index (or solrindex, perhaps).
> The difference between the 2 indeces I can tell is that the merged index doesn't contain
file index_done (and a hidden companion), but adding those to the merged index before merging
it again doesn't solve either.
> The way/workaround to make the merged index equivalent to the bin/nutch index generated
index seems to be putting it in a "part" subdirectory:
> bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
> IndexMerger: starting at 2011-03-26 11:18:10
> IndexMerger: merging indexes to: crawl/temp_indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/part-1
> Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
> IndexMerger: finished at 2011-03-26 11:18:12, elapsed: 00:00:01
> Where was this documented? I'd recommend rather not documenting but have IndexMerger
handle it as in the attached patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message