Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 30A32200D1F for ; Fri, 13 Oct 2017 15:56:56 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 2F05A1609E9; Fri, 13 Oct 2017 13:56:56 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4CF8D1609D1 for ; Fri, 13 Oct 2017 15:56:55 +0200 (CEST) Received: (qmail 30639 invoked by uid 500); 13 Oct 2017 13:56:53 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 30621 invoked by uid 99); 13 Oct 2017 13:56:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 13 Oct 2017 13:56:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 5DBF31A1F81 for ; Fri, 13 Oct 2017 13:56:52 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.631 X-Spam-Level: ** X-Spam-Status: No, score=2.631 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id vdjctaqUl_3H for ; Fri, 13 Oct 2017 13:56:49 +0000 (UTC) Received: from mail-wm0-f49.google.com (mail-wm0-f49.google.com [74.125.82.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5B2305F3E3 for ; Fri, 13 Oct 2017 13:56:49 +0000 (UTC) Received: by mail-wm0-f49.google.com with SMTP id t69so21692054wmt.2 for ; Fri, 13 Oct 2017 06:56:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=51hgJtuxVRcG+a9nV3C8D4QNYxzfCAimlqBFyfMS6Rw=; b=maUB68iMz2AljOZnZOd7tfgu2jWdsY5GGCyc6n/cKANB91kxxF5DUMKV07hWHVt+e9 jy88KpTxYugeqao5yVfV/X3BrOQn9rsNw9kPEL7g86l9jv5TcpXBi5Ku+9G0WMQOGvxs M+Sh7Pumm+ZaBdo+DG91sAGRsWP1Y2FY07BuZ8IoEMZR82SslRuBkOEdADvCrQscNI7b QbAJ7oC84H1HArIU80EWNjalD3Ezwpp0jSUDCqDYfhe1FlvEtTpI0uEMdVOfYywkeM+Q PQlP/lTPQkOaby2vpOJZTgwAozqhTSwdrmgqWoC28nMrXGEbvXdTHtx/C1xDomKOptaQ CCcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=51hgJtuxVRcG+a9nV3C8D4QNYxzfCAimlqBFyfMS6Rw=; b=OSQIFKwSMXr/Ys2E6M5YtI49oqaF5KDFEW2i4cE2SfruEbTqt/jRjRu/vGtn5kCr5r wkIYpFIIxsBymSwlJWaPs0KyscIpfSwmTpl84Ai8acEO8Q7dyx/PaZOu+DckoVlv0t0u aT7ZGhUacFk2SsqEZGJ4piIHXtuUAFLMgryLmtNZVqz9Cz/ZcfoMzA4GqrhoVgbFltAw m3IVwe96TszxIIlluN+JDcBVgIGkAY+KI6MDjew0d1coJBJ9j6D8oXNFKMk+lwmH9UcA z9Nmhm4oD3U3EgHbXrYZJG6SN7divucZ2Vatmohz8eCeR1xF0NLn6OqfLlmHZUrnqKoK Tbhg== X-Gm-Message-State: AMCzsaXv9YW+/94VDkATQwBAqSo6hJaCicH3LMmtrB/ZuEEyvS6B5rfs XnJqtB9WjEKHmG17K+5rytbwbqwi2XYgmF+KIZ6gjg== X-Google-Smtp-Source: AOwi7QBVWbIBZMuwU7qyw08sPOnqI+KDQNlZ59x7gNxiRlXke9lMLRloqkO93Z1dOOqQscT96PHUAmPRzv6OomiOhoE= X-Received: by 10.80.144.203 with SMTP id d11mr2180451eda.196.1507901670484; Fri, 13 Oct 2017 06:34:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.80.189.137 with HTTP; Fri, 13 Oct 2017 06:34:29 -0700 (PDT) In-Reply-To: References: <2428.1507736301@freon.franz.com> <7170.1507900472@freon.franz.com> <7689.1507900908@freon.franz.com> From: Amrit Sarkar Date: Fri, 13 Oct 2017 19:04:29 +0530 Message-ID: Subject: Re: solr 7.0.1: exception running post to crawl simple website To: layer@franz.com Cc: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary="94eb2c0c3ad06cfb6f055b6db74e" archived-at: Fri, 13 Oct 2017 13:56:56 -0000 --94eb2c0c3ad06cfb6f055b6db74e Content-Type: text/plain; charset="UTF-8" Kevin, Just put "html" too and give it a shot. These are the types it is expecting: mimeMap = new HashMap<>(); mimeMap.put("xml", "application/xml"); mimeMap.put("csv", "text/csv"); mimeMap.put("json", "application/json"); mimeMap.put("jsonl", "application/json"); mimeMap.put("pdf", "application/pdf"); mimeMap.put("rtf", "text/rtf"); mimeMap.put("html", "text/html"); mimeMap.put("htm", "text/html"); mimeMap.put("doc", "application/msword"); mimeMap.put("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document"); mimeMap.put("ppt", "application/vnd.ms-powerpoint"); mimeMap.put("pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation"); mimeMap.put("xls", "application/vnd.ms-excel"); mimeMap.put("xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); mimeMap.put("odt", "application/vnd.oasis.opendocument.text"); mimeMap.put("ott", "application/vnd.oasis.opendocument.text"); mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation"); mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet"); mimeMap.put("txt", "text/plain"); mimeMap.put("log", "text/plain"); The keys are the types supported. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar wrote: > Ah! > > Only supported type is: text/html; encoding=utf-8 > > I am not confident of this either :) but this should work. > > See the code-snippet below: > > ...... > > if(res.httpStatus == 200) { > // Raw content type of form "text/html; encoding=utf-8" > String rawContentType = conn.getContentType(); > String type = rawContentType.split(";")[0]; > if(typeSupported(type) || "*".equals(fileTypes)) { > String encoding = conn.getContentEncoding(); > > .... > > > Amrit Sarkar > Search Engineer > Lucidworks, Inc. > 415-589-9269 > www.lucidworks.com > Twitter http://twitter.com/lucidworks > LinkedIn: https://www.linkedin.com/in/sarkaramrit2 > > On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer wrote: > >> Amrit Sarkar wrote: >> >> >> Strange, >> >> >> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's >> >> Content-Type. Let's see what it says now. >> >> Same thing. Verified Content-Type: >> >> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |& >> grep Content-Type >> Content-Type: text/html;charset=utf-8 >> quadra[git:master]$ ] >> >> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook >> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar >> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web >> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> SimplePostTool version 5.0.0 >> Posting web pages to Solr url http://localhost:8983/solr/han >> dbook/update/extract >> Entering auto mode. Indexing pages with content-types corresponding to >> file endings md >> SimplePostTool: WARNING: Never crawl an external web site faster than >> every 10 seconds, your IP will probably be blocked >> Entering recursive mode, depth=10, delay=0s >> Entering crawl at level 0 (1 links total, 1 new) >> SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a >> HTTP result status of 415 >> 0 web pages indexed. >> COMMITting Solr index changes to http://localhost:8983/solr/han >> dbook/update/extract... >> Time spent: 0:00:00.531 >> quadra[git:master]$ >> >> Kevin >> >> >> >> >> Amrit Sarkar >> >> Search Engineer >> >> Lucidworks, Inc. >> >> 415-589-9269 >> >> www.lucidworks.com >> >> Twitter http://twitter.com/lucidworks >> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> >> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer wrote: >> >> >> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get >> >> > >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> > >> >> > What is it expecting? >> >> > >> >> > $ docker exec -it --user=solr solr bin/post -c handbook >> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md >> >> > /docker-java-home/jre/bin/java -classpath >> /opt/solr/dist/solr-core-7.0.1.jar >> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook >> -Ddata=web >> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md >> >> > SimplePostTool version 5.0.0 >> >> > Posting web pages to Solr url http://localhost:8983/solr/ >> >> > handbook/update/extract >> >> > Entering auto mode. Indexing pages with content-types corresponding >> to >> >> > file endings md >> >> > SimplePostTool: WARNING: Never crawl an external web site faster than >> >> > every 10 seconds, your IP will probably be blocked >> >> > Entering recursive mode, depth=10, delay=0s >> >> > Entering crawl at level 0 (1 links total, 1 new) >> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html >> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md >> returned a >> >> > HTTP result status of 415 >> >> > 0 web pages indexed. >> >> > COMMITting Solr index changes to http://localhost:8983/solr/ >> >> > handbook/update/extract... >> >> > Time spent: 0:00:03.882 >> >> > $ >> >> > >> >> > Thanks. >> >> > >> >> > Kevin >> >> > >> > > --94eb2c0c3ad06cfb6f055b6db74e--