nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From BELLINI ADAM <mbel...@msn.com>
Subject RE: AW: DC metadata
Date Wed, 23 Sep 2009 15:17:22 GMT

yes i saw the differences and i wrote my index-cutom as the index-more plugin (nutch-1.0).
but guess u right !! i didnt use the addFiledOptions method to add my custom fileds information
...
so if i will add them in this method.. so for the parser i have to see first how is made the
htmlparser for nutch 1.0 ?

i will take a look at it and let u know...



public void addIndexBackendOptions(Configuration conf) {

    ///////////////////////////
    //    add lucene options //
    ///////////////////////////

    LuceneWriter.addFieldOptions("type", LuceneWriter.STORE.NO,
        LuceneWriter.INDEX.UNTOKENIZED, conf);

    // primaryType and subType are stored, indexed and un-tokenized
    LuceneWriter.addFieldOptions("primaryType", LuceneWriter.STORE.YES,
        LuceneWriter.INDEX.UNTOKENIZED, conf);
    LuceneWriter.addFieldOptions("subType", LuceneWriter.STORE.YES,
        LuceneWriter.INDEX.UNTOKENIZED, conf);

    LuceneWriter.addFieldOptions("contentLength", LuceneWriter.STORE.YES,
        LuceneWriter.INDEX.NO, conf);

    LuceneWriter.addFieldOptions("lastModified", LuceneWriter.STORE.YES,
        LuceneWriter.INDEX.NO, conf);

    // un-stored, indexed and un-tokenized
    LuceneWriter.addFieldOptions("date", LuceneWriter.STORE.NO,
        LuceneWriter.INDEX.UNTOKENIZED, conf);
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
    MIME = new MimeUtil(conf);
  }

thx :)



> From: Koch@huberverlag.de
> To: nutch-user@lucene.apache.org
> Date: Wed, 23 Sep 2009 16:12:20 +0200
> Subject: AW: DC metadata
> 
> Hi,
> 
> the howtos you're referring to are for Nutch 0.9. In Nutch 1.0 the indexing system changed
a little bit.
> 
> If you look at the index-basic or index-more plugin you see that the doc.add method changed.

> It's no longer doc.add(new Field("category", "puppies", false, true, false)) -> here
you create a field with all the store, token and term-vector options.
> In the new Nutch version, you just add a name-value pair with doc.add("category","puppies").
How this value should be stored, tokenized and so on, you specify in the addIndexBackendOptions(Configuration
conf) (this didn't exist in Nutch 0.9).
> 
> Have a look at the index-more plugin in Nutch 0.9 and compare it to the index-more plugin
of Nutch 1.0. I think then you should be able to see the difference.
> 
> 
> Kind regards,
> Martina
> 
> 
> -----Urspr√ľngliche Nachricht-----
> Von: BELLINI ADAM [mailto:mbellil@msn.com] 
> Gesendet: Mittwoch, 23. September 2009 15:45
> An: nutch-user@lucene.apache.org
> Betreff: RE: AW: DC metadata
> 
> 
> hi, thank you for your answer...
> 
> i was talking about this howto :
> 
> CreateNewFilter
> Howto
> add a category metadata to your index and be able to search for it. For
> this, you need to write an indexing filter and a query filter. 
> Indexing your custom metadata
> For the
> indexing filter, copy the index-more plugin, and change names, dirs,
> and build files appropriately. The main thing to change is the filter
> method:      public Document filter(Document doc, Parse parse, FetcherOutput fo)In it,
you can add your own fields. To add a new category with value "puppies", it will look something
like this:      doc.add(new Field("category", "puppies", false, true, false));See the Document.add
API for more info on the booleans. That's pretty much it for indexing.  
> Searching your metadata
> To search
> for this, you need to create a query filter. Copy the query-site
> plugin. Again change file names, directories, and build files as
> needed. The main java file is very simple, just change the string in
> the line with "super". Instead of:    super("site");You would have   super("category");Make
> sure that you put your new index-category and query-category plugins in
> your nutch-default.xml file. Don't forget to check that it's in your
> WEB-INF/classess directory too. 
> 
> 
> 
> so as you said i have to wrote a parser too, but some people had trouble with this howto
http://wiki.apache.org/nutch/WritingPluginExample-0.9it seems it doesnt work for nutch 1.0.
do you have some idea about what i have to change to this example to make it works for nutch
1.0 ??
> 
> thx a lot
> 
> 
> 
> > From: Koch@huberverlag.de
> > To: nutch-user@lucene.apache.org
> > Date: Wed, 23 Sep 2009 08:41:55 +0200
> > Subject: AW: DC metadata
> > 
> > Hi,
> > 
> > I don't know the howto you're referring to but I think it belongs to an older version
of Nutch.
> > 
> > Let me try to explain...
> > 
> > doc.add("key","value")  -  adds a new field to the document "doc" with the name
"key" and the value "value". With that knowledge the indexer just knows there is another field
to be added, but it doesn't know if it should be stored, tokenized, termvectored and so on.
> > In order to tell the indexer how to index this field, you have to add a new line
to the "addIndexBackendOptions(Configuration conf) method. This method is specified in every
indexing filter.
> > 
> > Example:
> > public void addIndexBackendOptions(Configuration conf) {
> > 	LuceneWriter.addFieldOptions("key", LuceneWriter.STORE.YES,LuceneWriter.INDEX.NO,
conf);
> > 	LuceneWriter.addFieldOptions("key2", LuceneWriter.STORE.NO,LuceneWriter.INDEX.TOKENIZED,LuceneWriter.VECTOR.POS,
conf);
> > }
> > 
> > You need a parsing filter to extract data from the URLs you're crawling. I'm not
aware of a DC metadata parser, so you need to write a parsing filter first, to extract the
relevant data for you. Then you can index this data with the indexing filter you wrote.
> > 
> > Hope this helps.
> 
> > Kind regards,
> > Martina
> > 
> > 
> > 
> > 
> > -----Urspr√ľngliche Nachricht-----
> > Von: BELLINI ADAM [mailto:mbellil@msn.com] 
> > Gesendet: Dienstag, 22. September 2009 23:08
> > An: nutch-user@lucene.apache.org
> > Betreff: RE: DC metadata
> > 
> > 
> > any idea guys ! i'm just stuck here :(
> > 
> > mbellil@msn.com
> > 
> > 
> > 
> > 
> > From: mbellil@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: RE: DC metadata
> > Date: Fri, 18 Sep 2009 14:12:35 +0000
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > hi again 
> > 
> > i just copied the directory of my new plugin 'which contains the jar file and the
plugin.xml' to the nutch/plugins directory , and when i index now it gives me this error :
> > 
> > 2009-09-18 10:03:44,754 WARN  mapred.LocalJobRunner - job_local_0024
> > java.lang.IllegalArgumentException: it doesn't make sense to have a field that is
neither indexed nor stored
> >         at org.apache.lucene.document.Field.<init>(Field.java:279)
> >         at org.apache.nutch.indexer.lucene.LuceneWriter.createLuceneDoc(LuceneWriter.java:133)
> >         at org.apache.nutch.indexer.lucene.LuceneWriter.write(LuceneWriter.java:239)
> >         at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:54)
> >         at org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:44)
> >         at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)
> >         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:158)
> >         at org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
> >         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
> >         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170)
> > 
> > 
> > should i write a parser plugin too ??
> > 
> > thx
> > 
> > 
> > 
> > From: mbellil@msn.com
> > To: nutch-user@lucene.apache.org
> > Subject: DC metadata
> > Date: Thu, 17 Sep 2009 18:30:23 +0000
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > hi,
> > i'm trying to add Dublingcode metadata to my index, i wrote the plugin as descriped
at http://wiki.apache.org/nutch/CreateNewFilter
> > 
> > and i build the project using ant...
> > but when crawled my intranet i can't find the DoublingCode metadata in my index
??
> > did i missunderstand something ?
> > 
> > thx
> >  		 	   		  
> > Windows Live helps you keep up with all your friends,  in one place. 		 	   		 

> > We are your photos. Share us now with  Windows Live Photos. 		 	   		  
> > _________________________________________________________________
> > Create a cool, new character for your Windows LiveT Messenger. 
> > http://go.microsoft.com/?linkid=9656621
>  		 	   		  
> _________________________________________________________________
> Attention all humans. We are your photos. Free us.
> http://go.microsoft.com/?linkid=9666046
 		 	   		  
_________________________________________________________________
Attention all humans. We are your photos. Free us.
http://go.microsoft.com/?linkid=9666046
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message