lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrit Sarkar <sarkaramr...@gmail.com>
Subject Re: solr 7.0.1: exception running post to crawl simple website
Date Fri, 13 Oct 2017 13:26:34 GMT
Ah!

Only supported type is: text/html; encoding=utf-8

I am not confident of this either :) but this should work.

See the code-snippet below:

......

if(res.httpStatus == 200) {
  // Raw content type of form "text/html; encoding=utf-8"
  String rawContentType = conn.getContentType();
  String type = rawContentType.split(";")[0];
  if(typeSupported(type) || "*".equals(fileTypes)) {
    String encoding = conn.getContentEncoding();

....


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <layer@franz.com> wrote:

> Amrit Sarkar wrote:
>
> >> Strange,
> >>
> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
> >> Content-Type. Let's see what it says now.
>
> Same thing.  Verified Content-Type:
>
> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
> grep Content-Type
>   Content-Type: text/html;charset=utf-8
> quadra[git:master]$ ]
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
> HTTP result status of 415
> 0 web pages indexed.
> COMMITting Solr index changes to http://localhost:8983/solr/
> handbook/update/extract...
> Time spent: 0:00:00.531
> quadra[git:master]$
>
> Kevin
>
> >>
> >> Amrit Sarkar
> >> Search Engineer
> >> Lucidworks, Inc.
> >> 415-589-9269
> >> www.lucidworks.com
> >> Twitter http://twitter.com/lucidworks
> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> >>
> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <layer@franz.com> wrote:
> >>
> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
> >> >
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> >
> >> > What is it expecting?
> >> >
> >> > $ docker exec -it --user=solr solr bin/post -c handbook
> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
> >> > /docker-java-home/jre/bin/java -classpath
> /opt/solr/dist/solr-core-7.0.1.jar
> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
> -Ddata=web
> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
> >> > SimplePostTool version 5.0.0
> >> > Posting web pages to Solr url http://localhost:8983/solr/
> >> > handbook/update/extract
> >> > Entering auto mode. Indexing pages with content-types corresponding to
> >> > file endings md
> >> > SimplePostTool: WARNING: Never crawl an external web site faster than
> >> > every 10 seconds, your IP will probably be blocked
> >> > Entering recursive mode, depth=10, delay=0s
> >> > Entering crawl at level 0 (1 links total, 1 new)
> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
> returned a
> >> > HTTP result status of 415
> >> > 0 web pages indexed.
> >> > COMMITting Solr index changes to http://localhost:8983/solr/
> >> > handbook/update/extract...
> >> > Time spent: 0:00:03.882
> >> > $
> >> >
> >> > Thanks.
> >> >
> >> > Kevin
> >> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message