Return-Path: Delivered-To: apmail-lucene-nutch-dev-archive@www.apache.org Received: (qmail 82278 invoked from network); 1 Feb 2006 09:31:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 1 Feb 2006 09:31:29 -0000 Received: (qmail 30620 invoked by uid 500); 1 Feb 2006 09:31:27 -0000 Delivered-To: apmail-lucene-nutch-dev-archive@lucene.apache.org Received: (qmail 30601 invoked by uid 500); 1 Feb 2006 09:31:27 -0000 Mailing-List: contact nutch-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: nutch-dev@lucene.apache.org Delivered-To: mailing list nutch-dev@lucene.apache.org Received: (qmail 30590 invoked by uid 500); 1 Feb 2006 09:31:27 -0000 Delivered-To: apmail-incubator-nutch-dev@incubator.apache.org Received: (qmail 30587 invoked by uid 99); 1 Feb 2006 09:31:27 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2006 01:31:25 -0800 Received: from ajax.apache.org (ajax.apache.org [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 9054CC9 for ; Wed, 1 Feb 2006 10:31:04 +0100 (CET) Message-ID: <1062435636.1138786264480.JavaMail.jira@ajax.apache.org> Date: Wed, 1 Feb 2006 10:31:04 +0100 (CET) From: "Andrzej Bialecki (JIRA)" To: nutch-dev@incubator.apache.org Subject: [jira] Commented: (NUTCH-192) meta data support for CrawlDatum In-Reply-To: <718440621.1138666652937.JavaMail.jira@ajax.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364782 ] Andrzej Bialecki commented on NUTCH-192: ----------------------------------------- There is a very real hazard in the fact that we don't store the dictionary. Let's consider this example: two plugins invoke WritableName.setName() with different classes, ClassA and ClassB. We get the mapping ClassA -> 23, ClassB -> 24. The files written by these plugins use just the byte IDs, 23 and 24. The someone changes the config file, and plugins are initialized in a reversed order, so consequently we get ClassB -> 23, ClassA ->24. And now the plugins cannot read the files they created because of the wrong class returned from MapWritable ... So, I'm still convinced that we need to save the dictionary. Unfortunately, for small amounts of metadata (typical use case) it blows up the on-disk size of MapWritable, which is why I thought using Strings would be cheaper ... Other things: In the javadoc for MapWritable it should be mentioned that any Writable type that one is going to use needs to be first registered with the WritableName.setName(). Or perhaps the method could do it automatically, but then the IDs will be unpredictable, depending on the order of iteration (which leads to the problem described above). Also, there is a bug in setName(): if you try adding the same mapping twice (which could happen in different places), the method should allocate just one ID for the class. As it is now, it will allocate new ID each time you call the method, even if the class name is the same. Just add this: public static synchronized void setName(Class writableClass, String name) { Object o = CLASS_TO_NAME.put(writableClass, name); NAME_TO_CLASS.put(name, writableClass); if (o != null) return; // already has an ID CLASS_TO_ID.put(writableClass, new Byte((byte)CLASS_TO_ID.size())); ID_TO_CLASS.put(new Byte((byte)ID_TO_CLASS.size()), writableClass); } > meta data support for CrawlDatum > -------------------------------- > > Key: NUTCH-192 > URL: http://issues.apache.org/jira/browse/NUTCH-192 > Project: Nutch > Type: Improvement > Versions: 0.8-dev > Reporter: Stefan Groschupf > Fix For: 0.8-dev > Attachments: metadata300106.patch, metadata310106.patch > > Supporting meta data in CrawlDatum would help to get a set of new nutch features realized and makes a lot possible to smaller special focused search engines. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira