nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vijith <vijithkv...@gmail.com>
Subject Re: How to get Term Frequency Vector
Date Tue, 10 Apr 2012 08:45:33 GMT
Yes I do want to perform concept matching (using ontology), not the
similarity.
So why cant I just update the outlink scores based on this matching and
thus prioritize the crawldb (any probs ??)

I have read some documentation on UIMA. So if I use it for some text
processing (eg use tokenizer AE) inside a filter, will it be
able to scale when I run nutch over the hadoop cluster ??

I am using Jena API for reading the OWL file.


On Sat, Mar 31, 2012 at 11:28 PM, SUJIT PAL <sujit.pal@comcast.net> wrote:

> Hi Vijith,
>
> Not sure if this is what you are asking, since similarity indicates a
> measure of likeness between like things, and a document and an ontology are
> different things...
>
> But I believe you can do this by calculating the "similarity" to an
> ontology using a parse filter and distributing the score to the outlinks in
> a scoring filter. We do something similar (we call it concept mapping) but
> we do it outside Nutch at the moment, basically reading the fetched data,
> decomposing it into phrases and checking each phrase for matches against an
> ontology. We don't compute a score for the entire document, rather we
> generate a concept map of {ontology_node_id, score} values which are
> normalized absolute values rather than relative to other documents in the
> corpus. For us at least, it doesn't make sense to distribute the scores in
> the concept map to the outlinks.
>
> Regarding the question of APIs and toolkits, it really depends on a bunch
> of things. Does your ontology come with its own API? If so, that would
> dictate how you interact with it. Does your ontology API provide a black
> box where you submit the entire text and it decides what matched and what
> did not? If so, there is nothing to do, just pass your text to it. If not,
> you may need some code to break the text up into chunks your ontology can
> work with. I have experimental evidence that indicates that if I restrict
> my phrase lookups to noun phrases (detected using OpenNLP chunker with the
> default english language model), I tend to get cleaner mappings, but that
> may not be true for your corpus. Depending on your data and the
> capabilities of your ontology API, you may also want to have (UIMA is a
> candidate toolkit) code in your parser where you extract entities and
> reformat them (say dates) into a normalized form your ontology can match
> against.
>
> -sujit
>
> On Mar 30, 2012, at 10:48 PM, Vijith wrote:
>
> > Pardon me , there was a mistake...
> > I want to update the outlink score based on the similarity, not the
> > document..
> >
> > On Fri, Mar 30, 2012 at 10:41 PM, Vijith <vijithkv.87@gmail.com> wrote:
> >
> >> Thanks..
> >> Here is my problem... Iam building a plugin that would convert nutch to
> >> focused crawler. i want to find the similarity of a document to an
> ontology
> >> during the crawl phase and then update the document score (set by OPIC
> ??)
> >> in a scoring filter.
> >>
> >> So how about using some APIs or toolkits (TIKA, UIMA - never used it
> >> before) in a parse filter and find the similarity.
> >>
> >> @Julien - I believe using behemoth would let me use the power of hadoop
> >> while doing this, is that right ??
> >>
> >> How is the same implemented in SOLR ??
> >>
> >>
> >>
> >> On Fri, Mar 30, 2012 at 12:15 AM, Julien Nioche <
> >> lists.digitalpebble@gmail.com> wrote:
> >>
> >>> One option would be to use Behemoth to convert the Nutch segments,
> >>> tokenize
> >>> (e.g. with UIMA) then generate vectors for Mahout
> >>>
> >>> see https://github.com/jnioche/behemoth
> >>>
> >>> Julien
> >>>
> >>> On 29 March 2012 15:54, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com
> >>>> wrote:
> >>>
> >>>> I personally don't have a concrete answer, however please have a look
> >>> here
> >>>> for an experiment carried out by Sujit.
> >>>>
> >>>>
> >>>>
> >>>
> http://sujitpal.blogspot.co.uk/2011/10/computing-document-similarity-using.html
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Mar 28, 2012 at 7:10 AM, Vijith <vijithkv.87@gmail.com>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I am using nutch 1.4. How can I get the term frequency vector
> >>>> corresponding
> >>>>> to each document in the 'crawl' phase.
> >>>>> ie iam not using Solr-indexing.
> >>>>>
> >>>>> --
> >>>>> *Thanks & Regards*
> >>>>> *
> >>>>> *
> >>>>> *Vijith V*
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> *Lewis*
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> *
> >>> *Open Source Solutions for Text Engineering
> >>>
> >>> http://digitalpebble.blogspot.com/
> >>> http://www.digitalpebble.com
> >>> http://twitter.com/digitalpebble
> >>>
> >>
> >>
> >>
> >> --
> >> *Thanks & Regards*
> >> *
> >> *
> >> *Vijith V*
> >>
> >>
> >>
> >
> >
> > --
> > *Thanks & Regards*
> > *
> > *
> > *Vijith V*
>
>


-- 
*Thanks & Regards*
*
*
*Vijith V*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message