lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Luis Betancourt González <jlbetanco...@uci.cu>
Subject Re: [MASSMAIL]Weighting of prominent text in HTML
Date Mon, 26 Jan 2015 05:47:58 GMT
Hi Dan:

Agreed, this question is more Nutch related than Solr ;)

Nutch doesn't send any data into /update/extract request handler, all the text and metadata
extraction happens in Nutch side rather than relying in the ExtractRequestHandler provided
by Solr. Underneath Nutch use Tika the same technology as the ExtractRequestHandler provided
by Solr so shouldn't be any greater difference. 

By default Nutch doesn't boost anything as is Solr job to boost the different content in the
different fields, which is what happens when you do a query against Solr. Nutch calculates
the LinkRank which is a variation of the famous PageRank (or the OPIC score, which is another
scoring algorithm implemented in Nutch, which I believe is the default in Nutch 2.x). What
you can do is use the headings and map the heading tags into different fields and then apply
different boosts to each field. 

The general idea with Nutch is to "make pieces of the web page" and store each piece in a
different field in Solr, then you can tweak your relevance function using the values yo see
fit, so you don't need to write any plugin to accomplish this (at least for the h1, h2, etc.
example you provided, if you want to extract other parts of the webpage you'll need to write
your own plugin to do so). 

Nutch is highly customizable, you can write a plugin for almost any piece of logic, from parsers
to indexers, passing from URL filters, scoring algorithms, protocols and a long long list,
usually the plugins are not so difficult to write, but the problem comes to know which extension
point you need to use, this comes with experience and taking a good dive in the source code.

Hope this helps,

----- Original Message -----
From: "Dan Davis" <dansmood@gmail.com>
To: "solr-user" <solr-user@lucene.apache.org>
Sent: Monday, January 26, 2015 12:08:13 AM
Subject: [MASSMAIL]Weighting of prominent text in HTML

By examining solr.log, I can see that Nutch is using the /update request
handler rather than /update/extract.   So, this may be a more appropriate
question for the nutch mailing list.   OTOH, y'all know the anwser off the
top of your head.

Will Nutch boost text occurring in h1, h2, etc. more heavily than text in a
normal paragraph?    Can this weighting be tuned without writing a plugin?
   Is writing a plugin often needed because of the flexibility that is
needed in practice?

I wanted to call this post *Anatomy of a small scale search engine*, but
lacked the nerve ;)

Thanks, all and many,

Dan Davis, Systems/Applications Architect
National Library of Medicine


---------------------------------------------------
XII Aniversario de la creación de la Universidad de las Ciencias Informáticas. 12 años
de historia junto a Fidel. 12 de diciembre de 2014.


Mime
View raw message