incubator-connectors-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erlend Garåsen <e.f.gara...@usit.uio.no>
Subject Re: Filtering out unwanted content from HTML pages
Date Sun, 03 Apr 2011 20:11:27 GMT

I agree with you. I also discussed this with a colleague, and we decided 
to try to rewrite or extend some of the Tika classes in order to get 
this functionality. I'll notify the list if I manage to fix this, but it 
might take some time since we're not working with content enrichment yet.

Erlend

On 31.03.11 16.56, Karl Wright wrote:
> This is a good question.  I think we should carry this conversation
> forward on connectors-dev.
>
> My initial thought on this issue is that the functionality really
> belongs in Tika.  Tika is set up to extract and filter in exactly this
> way.  The only reason you'd want to do it in MCF is if it would change
> the links you might extract (or, skip), and that seems to me less
> interesting.  How do you feel about it?
>
> Karl
>
> On Thu, Mar 31, 2011 at 10:41 AM, Erlend Garåsen
> <e.f.garasen@usit.uio.no>  wrote:
>>
>> All major commercial search engines are shipped with a web crawler which
>> allows one to filter out unwanted content, such as certain html blocks,
>> comments etc. Would it be advisable to add such a functionality to MCF? Or
>> will it be difficult to implement since the idea behind the
>> ExtractingRequestHandler is to send binary files to Solr?
>>
>> Say that you have an HTML document which includes the following comments:
>> <!-- stop indexing -->
>> <!-- start indexing -->
>> All content within these comments should then be skipped from the index.
>>
>> I managed to rewrite Apache Nutch in order to add this functionality for
>> some months ago.
>>
>> Erlend
>>
>> --
>> Erlend Garåsen
>> Center for Information Technology Services
>> University of Oslo
>> P.O. Box 1086 Blindern, N-0317 OSLO, Norway
>> Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
>>


-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Mime
View raw message