nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koch Martina <K...@huberverlag.de>
Subject AW: DC metadata
Date Wed, 23 Sep 2009 06:41:55 GMT
Hi,

I don't know the howto you're referring to but I think it belongs to an older version of Nutch.

Let me try to explain...

doc.add("key","value")  -  adds a new field to the document "doc" with the name "key" and
the value "value". With that knowledge the indexer just knows there is another field to be
added, but it doesn't know if it should be stored, tokenized, termvectored and so on.
In order to tell the indexer how to index this field, you have to add a new line to the "addIndexBackendOptions(Configuration
conf) method. This method is specified in every indexing filter.

Example:
public void addIndexBackendOptions(Configuration conf) {
	LuceneWriter.addFieldOptions("key", LuceneWriter.STORE.YES,LuceneWriter.INDEX.NO, conf);
	LuceneWriter.addFieldOptions("key2", LuceneWriter.STORE.NO,LuceneWriter.INDEX.TOKENIZED,LuceneWriter.VECTOR.POS,
conf);
}

You need a parsing filter to extract data from the URLs you're crawling. I'm not aware of
a DC metadata parser, so you need to write a parsing filter first, to extract the relevant
data for you. Then you can index this data with the indexing filter you wrote.

Hope this helps.

Kind regards,
Martina




-----Urspr√ľngliche Nachricht-----
Von: BELLINI ADAM [mailto:mbellil@msn.com] 
Gesendet: Dienstag, 22. September 2009 23:08
An: nutch-user@lucene.apache.org
Betreff: RE: DC metadata


any idea guys ! i'm just stuck here :(

mbellil@msn.com




From: mbellil@msn.com
To: nutch-user@lucene.apache.org
Subject: RE: DC metadata
Date: Fri, 18 Sep 2009 14:12:35 +0000








hi again 

i just copied the directory of my new plugin 'which contains the jar file and the plugin.xml'
to the nutch/plugins directory , and when i index now it gives me this error :

2009-09-18 10:03:44,754 WARN  mapred.LocalJobRunner - job_local_0024
java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither
indexed nor stored
        at org.apache.lucene.document.Field.<init>(Field.java:279)
        at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
        at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
        at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
        at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)


should i write a parser plugin too ??

thx



From: mbellil@msn.com
To: nutch-user@lucene.apache.org
Subject: DC metadata
Date: Thu, 17 Sep 2009 18:30:23 +0000








hi,
i'm trying to add Dublingcode metadata to my index, i wrote the plugin as descriped at http://wiki.apache.org/nutch/CreateNewFilter

and i build the project using ant...
but when crawled my intranet i can't find the DoublingCode metadata in my index ??
did i missunderstand something ?

thx
 		 	   		  
Windows Live helps you keep up with all your friends,  in one place. 		 	   		  
We are your photos. Share us now with  Windows Live Photos. 		 	   		  
_________________________________________________________________
Create a cool, new character for your Windows LiveT Messenger. 
http://go.microsoft.com/?linkid=9656621

Mime
View raw message