incubator-any23-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Fwd: NUTCH-1129
Date Wed, 18 Apr 2012 09:10:56 GMT
Hi Guys,

A rather interesting discussion has emerged over on dev@nutch regarding
building the Any23 Nutch plugin[0], please see Sebastian Nagel's comments
below for the most recent contribution... which got me thinking more about
it today. I would advise you to maybe read over the short conversation
before reading on as it's better in context :0)

The overwhelming majority of Nutch users build searchable Solr indexes from
the content they retrieve via Nutch, therefore we're looking to build a
plugin solution which does a double task
1) Tika wrapped Any23 Parser plugin - enabling us to use core Any23 parsers
for extraction.
2) An HtmlIndexingFilter - enabling us to process the triples and to get
them into a Solr index in such a way which is easily searchable via fields.

As we discussed and as Sebastian graphically highlights below, this is not
clear cut, therefore I wanted to hear anyones thoughts/input on building 2)
before I begin.

Thanks in advance

Lewis

[0] http://www.mail-archive.com/dev%40nutch.apache.org/msg07104.html

---------- Forwarded message ----------
From: Sebastian Nagel <wastl.nagel@googlemail.com>
Date: Tue, Apr 17, 2012 at 11:19 PM
Subject: Re: NUTCH-1129
To: dev@nutch.apache.org


>> Well, we could easily use certain microdata key/value pairs in our
results
>> to greatly improve search and navigation.

Microdata is a good show-case for the Any23 plugin.

Another example would be semantic markup in shops.
Any23 already does a good job in extracting the semantic content:

 $ any23tools Rover \
  'http://www.shopforia.com/cgi-**bin/apf4/apf4.cgi?Operation=**
ItemLookup&ItemId=B007P4VOWC<http://www.shopforia.com/cgi-bin/apf4/apf4.cgi?Operation=ItemLookup&ItemId=B007P4VOWC>
'

The question is how to map triples to key-value pairs (NutchFields)
in a straight-forward but configurable way.
The triples
 <#Offering_0635753498301> <#hasPriceSpecification>
<#UnitPriceSpecification> .
 <#UnitPriceSpecification> <#hasCurrencyValue> "249.99"^^<#float> ;
       <#hasCurrency> "USD"^^<#string> ;
       ... .
and the pair
 price = 249.99 USD
are the same information. Nutch (or Solr etc.) require the latter form
if you want to set up a shop search. But conversion is not as simple
(maybe I'm wrong?):
 - information may be spread over several triples
 - there may be multiple products per document
  (same predicate for different subjects) => use sub-documents?

Sebastian


On 04/17/2012 08:05 PM, Lewis John Mcgibbney wrote:

> Hi Markus,
>
> On Tue, Apr 17, 2012 at 12:21 PM, Markus Jelsma
> <markus.jelsma@openindex.io>**wrote:
>
>  You did indeed suggest that. However, if building a wrapper is fairly
>> straightforward then it may not be a bad idea. I haven't seen any hint of
>> Tika
>> having Any23 on-board any time soon so we might have to wait a very long
>> time
>> if we want to rely on Tika.
>>
>>
> Yeah +1. As I explained to Julien we are some way from thinking about
> integration into Tika and subsequently writing the parser implementation(s)
> for use within Tika.
>
>
>
>>
>> Well, we could easily use certain microdata key/value pairs in our results
>> to
>> greatly improve search and navigation.
>>
>>
> Yeah, microdata is just one from a whole bunch of formats Any23 can handle.
> My reservations were how to represent the many different formats in a way
> which would be easily navigable (is that a word?) within an index. There is
> obviously work to be done here from my side.
>
> Thanks
>
> Lewis
>
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message