manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Extracting Content from Web Crawler using the new PipeLine
Date Thu, 23 Oct 2014 16:57:32 GMT
Looking at the SOLR patch, I have two concerns.  First, here's the
pertinent part of the patch:

>>>>>>
+          boilerpipe = "de.l3s.boilerpipe.extractors." + boilerpipe;
+          try {
+            ClassLoader loader =
BoilerpipeExtractor.class.getClassLoader();
+            Class extractorClass = loader.loadClass(boilerpipe);
+
+            BoilerpipeExtractor boilerpipeExtractor =
(BoilerpipeExtractor)extractorClass.newInstance();
+            BoilerpipeContentHandler boilerPipeContentHandler = new
BoilerpipeContentHandler(parsingHandler, boilerpipeExtractor);
+
+            parsingHandler = (ContentHandler)boilerPipeContentHandler;
+          } catch (ClassNotFoundException e) {
+            log.warn("BoilerpipeExtractor " + boilerpipe + " not found!");
+          } catch (InstantiationException e) {
+            log.warn("Could not instantiate " + boilerpipe);
+          } catch (Exception e) {
+            log.warn(e.toString());
+          }
<<<<<<

The actual extractor in this patch must be specified (the "boilerpipe"
variable).  I imagine there are a number of different extractors, probably
for different kinds of XML/XHTML.  Am I right?  If so, how do you expect a
user to be able to select this, since most jobs crawl documents of multiple
types?

Secondly, the BoilerPlateContentHandler is just a sax ContentHandler, which
basically implies that we'd be parsing XML somehow.  But we don't currently
do that in ManifoldCF for the Tika extractor; I believe the parsing occurs
inside Tika in that case.  If there's a way to configure Tika to use a
specific boilerpipe extractor, that would be the closest match to this kind
of functionality, I believe.  But in any case, this patch does NOT push tag
data into metadata fields -- there's no mechanism for that, unless Solr's
implementation of ContentHandler somehow does it.  Can you give examples of
input and output that you expect to see for this proposed functionality?

Karl


On Thu, Oct 23, 2014 at 11:57 AM, Arcadius Ahouansou <arcadius@menelic.com>
wrote:

> Hello Abe-San.
>
> Thank you for the response.
>
> The BoilerPipe library I was referring to helps to remove
> common/repetitive page components such as menu items, headings, footers etc
> from the crawled content.
>
> There is a Solr Patch at
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-3808
>
> That I have been using.
> Thought it would be good to have Manifold do this instead.
>
> It would also be interesting to have Manifold able to extract content of
> html tags such as div, h1,... like Solr.
>
> Thanks
> On 23 Oct 2014 07:03, "Shinichiro Abe" <shinichiro.abe.1@gmail.com> wrote:
>
>> Hi Arcadius,
>>
>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>> Yes, Tika extractor will remove tags in html
>> and send content and metadata to downstream pipeline/output connection.
>>
>> > - What about extracting specific HTML tags such as all h1 or h2 and map
>> them to a Solr field?
>> No, currently it can map only metadata which is extracted by Tika to Solr
>> field.
>> For h1, h2, p tags etc,  Tika extractor doesn't capture them and doesn't
>> treat them as metadata.
>> Currently when capturing these tags and map them to fields,
>> we have to use Solr's ExtractingRequestHandler(CAPTURE_ELEMENTS param).
>>
>> Regards,
>> Shinichiro Abe
>>
>> On 2014/10/23, at 10:21, Arcadius Ahouansou <arcadius@menelic.com> wrote:
>>
>> >
>> > Hello.
>> >
>> > Given that we now have pipelines in ManifoldCF, How feasible  is it to:
>> >
>> > - use Tika's BoilerPipe to get cleaner content from web sites?
>> > - What about extracting specific HTML tags such as all h1 or h2 and map
>> them to a Solr field?
>> >
>> > Thank you very much.
>> >
>> > Arcadius.
>> >
>>
>>

Mime
View raw message