nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koch Martina <>
Subject AW: DC metadata
Date Wed, 23 Sep 2009 06:41:55 GMT

I don't know the howto you're referring to but I think it belongs to an older version of Nutch.

Let me try to explain...

doc.add("key","value")  -  adds a new field to the document "doc" with the name "key" and
the value "value". With that knowledge the indexer just knows there is another field to be
added, but it doesn't know if it should be stored, tokenized, termvectored and so on.
In order to tell the indexer how to index this field, you have to add a new line to the "addIndexBackendOptions(Configuration
conf) method. This method is specified in every indexing filter.

public void addIndexBackendOptions(Configuration conf) {
	LuceneWriter.addFieldOptions("key", LuceneWriter.STORE.YES,LuceneWriter.INDEX.NO, conf);
	LuceneWriter.addFieldOptions("key2", LuceneWriter.STORE.NO,LuceneWriter.INDEX.TOKENIZED,LuceneWriter.VECTOR.POS,

You need a parsing filter to extract data from the URLs you're crawling. I'm not aware of
a DC metadata parser, so you need to write a parsing filter first, to extract the relevant
data for you. Then you can index this data with the indexing filter you wrote.

Hope this helps.

Kind regards,

-----Urspr√ľngliche Nachricht-----
Gesendet: Dienstag, 22. September 2009 23:08
Betreff: RE: DC metadata

any idea guys ! i'm just stuck here :(

Subject: RE: DC metadata
Date: Fri, 18 Sep 2009 14:12:35 +0000

hi again 

i just copied the directory of my new plugin 'which contains the jar file and the plugin.xml'
to the nutch/plugins directory , and when i index now it gives me this error :

2009-09-18 10:03:44,754 WARN  mapred.LocalJobRunner - job_local_0024
java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither
indexed nor stored
        at org.apache.lucene.document.Field.<init>(
        at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(
        at org.apache.nutch.indexer.lucene.LuceneWriter.write(
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
        at org.apache.hadoop.mapred.ReduceTask$3.collect(
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(
        at org.apache.hadoop.mapred.LocalJobRunner$

should i write a parser plugin too ??


Subject: DC metadata
Date: Thu, 17 Sep 2009 18:30:23 +0000

i'm trying to add Dublingcode metadata to my index, i wrote the plugin as descriped at

and i build the project using ant...
but when crawled my intranet i can't find the DoublingCode metadata in my index ??
did i missunderstand something ?

Windows Live helps you keep up with all your friends,  in one place. 		 	   		  
We are your photos. Share us now with  Windows Live Photos. 		 	   		  
Create a cool, new character for your Windows LiveT Messenger.

View raw message