nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koch Martina <K...@huberverlag.de>
Subject AW: DC metadata
Date Wed, 23 Sep 2009 14:12:20 GMT
Hi,

the howtos you're referring to are for Nutch 0.9. In Nutch 1.0 the indexing system changed
a little bit.

If you look at the index-basic or index-more plugin you see that the doc.add method changed.

It's no longer doc.add(new Field("category", "puppies", false, true, false)) -> here you
create a field with all the store, token and term-vector options.
In the new Nutch version, you just add a name-value pair with doc.add("category","puppies").
How this value should be stored, tokenized and so on, you specify in the addIndexBackendOptions(Configuration
conf) (this didn't exist in Nutch 0.9).

Have a look at the index-more plugin in Nutch 0.9 and compare it to the index-more plugin
of Nutch 1.0. I think then you should be able to see the difference.


Kind regards,
Martina


-----Urspr√ľngliche Nachricht-----
Von: BELLINI ADAM [mailto:mbellil@msn.com] 
Gesendet: Mittwoch, 23. September 2009 15:45
An: nutch-user@lucene.apache.org
Betreff: RE: AW: DC metadata


hi, thank you for your answer...

i was talking about this howto :

CreateNewFilter
Howto
add a category metadata to your index and be able to search for it. For
this, you need to write an indexing filter and a query filter. 
Indexing your custom metadata
For the
indexing filter, copy the index-more plugin, and change names, dirs,
and build files appropriately. The main thing to change is the filter
method:      public Document filter(Document doc, Parse parse, FetcherOutput fo)In it, you
can add your own fields. To add a new category with value "puppies", it will look something
like this:      doc.add(new Field("category", "puppies", false, true, false));See the Document.add
API for more info on the booleans. That's pretty much it for indexing.  
Searching your metadata
To search
for this, you need to create a query filter. Copy the query-site
plugin. Again change file names, directories, and build files as
needed. The main java file is very simple, just change the string in
the line with "super". Instead of:    super("site");You would have   super("category");Make
sure that you put your new index-category and query-category plugins in
your nutch-default.xml file. Don't forget to check that it's in your
WEB-INF/classess directory too. 



so as you said i have to wrote a parser too, but some people had trouble with this howto http://wiki.apache.org/nutch/WritingPluginExample-0.9it
seems it doesnt work for nutch 1.0. do you have some idea about what i have to change to this
example to make it works for nutch 1.0 ??

thx a lot



> From: Koch@huberverlag.de
> To: nutch-user@lucene.apache.org
> Date: Wed, 23 Sep 2009 08:41:55 +0200
> Subject: AW: DC metadata
> 
> Hi,
> 
> I don't know the howto you're referring to but I think it belongs to an older version
of Nutch.
> 
> Let me try to explain...
> 
> doc.add("key","value")  -  adds a new field to the document "doc" with the name "key"
and the value "value". With that knowledge the indexer just knows there is another field to
be added, but it doesn't know if it should be stored, tokenized, termvectored and so on.
> In order to tell the indexer how to index this field, you have to add a new line to the
"addIndexBackendOptions(Configuration conf) method. This method is specified in every indexing
filter.
> 
> Example:
> public void addIndexBackendOptions(Configuration conf) {
> 	LuceneWriter.addFieldOptions("key", LuceneWriter.STORE.YES,LuceneWriter.INDEX.NO, conf);
> 	LuceneWriter.addFieldOptions("key2", LuceneWriter.STORE.NO,LuceneWriter.INDEX.TOKENIZED,LuceneWriter.VECTOR.POS,
conf);
> }
> 
> You need a parsing filter to extract data from the URLs you're crawling. I'm not aware
of a DC metadata parser, so you need to write a parsing filter first, to extract the relevant
data for you. Then you can index this data with the indexing filter you wrote.
> 
> Hope this helps.

> Kind regards,
> Martina
> 
> 
> 
> 
> -----Urspr√ľngliche Nachricht-----
> Von: BELLINI ADAM [mailto:mbellil@msn.com] 
> Gesendet: Dienstag, 22. September 2009 23:08
> An: nutch-user@lucene.apache.org
> Betreff: RE: DC metadata
> 
> 
> any idea guys ! i'm just stuck here :(
> 
> mbellil@msn.com
> 
> 
> 
> 
> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: RE: DC metadata
> Date: Fri, 18 Sep 2009 14:12:35 +0000
> 
> 
> 
> 
> 
> 
> 
> 
> hi again 
> 
> i just copied the directory of my new plugin 'which contains the jar file and the plugin.xml'
to the nutch/plugins directory , and when i index now it gives me this error :
> 
> 2009-09-18 10:03:44,754 WARN  mapred.LocalJobRunner - job_local_0024
> java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither
indexed nor stored
>         at org.apache.lucene.document.Field.<init>(Field.java:279)
>         at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
>         at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
>         at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
>         at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
>         at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
>         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
> 
> 
> should i write a parser plugin too ??
> 
> thx
> 
> 
> 
> From: mbellil@msn.com
> To: nutch-user@lucene.apache.org
> Subject: DC metadata
> Date: Thu, 17 Sep 2009 18:30:23 +0000
> 
> 
> 
> 
> 
> 
> 
> 
> hi,
> i'm trying to add Dublingcode metadata to my index, i wrote the plugin as descriped at
http://wiki.apache.org/nutch/CreateNewFilter
> 
> and i build the project using ant...
> but when crawled my intranet i can't find the DoublingCode metadata in my index ??
> did i missunderstand something ?
> 
> thx
>  		 	   		  
> Windows Live helps you keep up with all your friends,  in one place. 		 	   		  
> We are your photos. Share us now with  Windows Live Photos. 		 	   		  
> _________________________________________________________________
> Create a cool, new character for your Windows LiveT Messenger. 
> http://go.microsoft.com/?linkid=9656621
 		 	   		  
_________________________________________________________________
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046

Mime
View raw message