From user-return-33851-archive-asf-public=cust-asf.ponee.io@nutch.apache.org Fri Feb 9 08:31:41 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id F3676180654 for ; Fri, 9 Feb 2018 08:31:40 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E34BA160C4C; Fri, 9 Feb 2018 07:31:40 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0D513160C2E for ; Fri, 9 Feb 2018 08:31:39 +0100 (CET) Received: (qmail 26044 invoked by uid 500); 9 Feb 2018 07:31:38 -0000 Mailing-List: contact user-help@nutch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@nutch.apache.org Delivered-To: mailing list user@nutch.apache.org Received: (qmail 26028 invoked by uid 99); 9 Feb 2018 07:31:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Feb 2018 07:31:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id A904E1802E6 for ; Fri, 9 Feb 2018 07:31:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.991 X-Spam-Level: ** X-Spam-Status: No, score=2.991 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_SHORT=0.001, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kFAhHkEnrfIk for ; Fri, 9 Feb 2018 07:31:34 +0000 (UTC) Received: from alpha.private (mail.zion.com [96.81.38.161]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id A2C295F178 for ; Fri, 9 Feb 2018 07:31:33 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by alpha.private (Postfix) with ESMTP id 624EE7C10460 for ; Fri, 9 Feb 2018 00:31:26 -0700 (MST) X-Virus-Scanned: amavisd-new at zion.com Received: from alpha.private ([127.0.0.1]) by localhost (alpha.private [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3AeXgTJVaBMV for ; Fri, 9 Feb 2018 00:31:13 -0700 (MST) Received: from [10.0.1.19] (mail.zion.com [96.81.38.161]) by alpha.private (Postfix) with ESMTPSA id 6F4897C10433 for ; Fri, 9 Feb 2018 00:31:12 -0700 (MST) From: David Ferrero Content-Type: multipart/alternative; boundary="Apple-Mail=_48381A61-5E7F-4F6A-A0D5-673DE166CA6E" Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction? Date: Fri, 9 Feb 2018 00:31:10 -0700 References: To: user@nutch.apache.org In-Reply-To: Message-Id: <77BEE181-F2B8-42FE-888D-46520B1E85AE@zion.com> X-Mailer: Apple Mail (2.3445.5.20) --Apple-Mail=_48381A61-5E7F-4F6A-A0D5-673DE166CA6E Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Thank you for this information. Since this is very much related to Any23 = and microdata parsing, I=E2=80=99m going to ask what I believe is a = related question but keep this same thread so it will be organized in = one place: I noticed a lot of job boards such as dice.com , = monster.com , etc use http://schema.org/JobPosting = information, however many seem to use = rather = than RDF. Summer 2017, Google announced structured data guidance for Jobs: https://developers.google.com/search/docs/data-types/job-posting = and a testing tool to validate your HTML: = https://search.google.com/structured-data/testing-tool I verified a few sample listings on the above mentioned job boards on = google=E2=80=99s testing-tool and they validate OK. So after looking at http://any23.apache.org/getting-started.html = for the supported = extractors, I see Any23 mentions it supports JSON+LD input, so I added = this to nutch-site.xml to override the same property in = nutch-default.xml: any23.extractors html-microdata,html-embedded-jsonld,rdf-jsonld Comma-separated list of Any23 extractors (a list of = extractors is available here: = http://any23.apache.org/getting-started.html) I expected to see additional information from nutch parsechecker after = adding the jsonld extractors, however I see NO changes to Any23-Triples = microdata parsed.=20 What might I be doing wrong? > On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney = wrote: >=20 > Hi David, > Answers inline >=20 > On Thu, Feb 8, 2018 at 9:19 AM, = wrote: >=20 >>=20 >> From: David Ferrero >> To: user@nutch.apache.org >> Cc: >> Bcc: >> Date: Thu, 8 Feb 2018 10:19:52 -0700 >> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and = extraction? >> Pull request #205 was recently merged into master branch for Nutch = 1.x in >> fulfillment of NUTCH-1129 "microdata for Nutch 1.x" >>=20 >> I am new to nutch and solr and have just started crawling and = indexing a >> few select websites. Using the built in html parsing/indexing, I am = getting >> searchable fields like url, content, host, sometimes a title, and a = few >> other indexing related fields like digest, boost, segment, and = tstamp. That >> said, I realized very quickly that I need better results. While = exploring >> the source of the website, I noticed references to schema.org and get >> excited by what I see. That=E2=80=99s how I stumbled upon NUTCH-1129. >>=20 >> I=E2=80=99ve built apache-nutch-1.15-SNAPSHOT which includes Any23 = parser/indexer. >>=20 >=20 > Excellent. >=20 >=20 >>=20 >> Q: Now what? How do I gain Any23 microdata parsing / indexing >> capabilities introduced by NUTCH-1129? >> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in >> plugin.includes with something like parse-(html | tika | >> any23)|index-(basic | anchor | any23) >>=20 >=20 > No, you just add 'any23' to the list of plugins within the = plugin.includes > property of nutch-site.xml >=20 >=20 >> Q: How do I expose the discovered microdata structure / items to = end-user >> such as Solr? For example, what are the microdata items and do I need = to >> map them to Solr in solrindex-mapping.xml? >>=20 >=20 > OK, so current configuration for the Any23 plugin, is to store = extracted > structured data markup in the Nutch Metadata object with a key " > Any23-Triples". You can locate it using something like the = ParserChekcer > tool provided via the 'nutch' script. Liekwise you can also locate it, = as a > representation of what would be indexed, by using the IndexerChecker > tooling also provided within the 'nutch' script. >=20 > An example would be as follows, data is now indexed as follows = (example > after crawling https://smartive.ch/jobs): >=20 >=20 > "structured_data": [ > { > "node": "", > "value": "\"IE-edge,chrome=3D1\"@de", > "key": = "", > "short_key": "X-UA-Compatible" > }, > { > "node": "", > "value": "\"Wir sind smartive \\u2014 eine dynamische, > innovative Schweizer Webentwicklungsagentur. Die Realisierung > zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer > Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und > Kunden.\"@de", > "key": "", > "short_key": "description" > }, > { > "node": "", > "value": "\"width=3Ddevice-width, initial-scale=3D1, > shrink-to-fit=3Dno\"@de", > "key": "", > "short_key": "viewport" > }, > { > "node": "", > "value": "\"width=3Ddevice-width,initial-scale=3D1\"@de", > "key": "", > "short_key": "viewport" > }, > { > "node": "", > "value": "\"ie=3Dedge\"@de", > "key": = "", > "short_key": "x-ua-compatible" > } > ], >=20 >=20 > Note from above, that the 'predicate' key field is very useful for = quickly > filtering through, for example, Hotel Ratings, or something similar. >=20 >=20 >>=20 >> I=E2=80=99d also be interested to learn how to point at a specific = URL and see how >> nutch sees the microdata (best case), then learn how to leverage this = into >> nutch and finally into solr. >>=20 >>=20 > See the tooling for ParserChecker and IndexerChecker as explained = above. > Any further question, please let me know. > Lewis --Apple-Mail=_48381A61-5E7F-4F6A-A0D5-673DE166CA6E--