lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrit Sarkar <sarkaramr...@gmail.com>
Subject Re: solr 7.0.1: exception running post to crawl simple website
Date Fri, 13 Oct 2017 11:47:32 GMT
Kevin,

You are getting NPE at:

String type = rawContentType.split(";")[0]; //HERE - rawContentType is NULL

// related code

String rawContentType = conn.getContentType();

public String getContentType() {
    return getHeaderField("content-type");
}

HttpURLConnection conn = (HttpURLConnection) u.openConnection();

Can you check at your webpage level headers are properly set and it
has key "content-type".


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Wed, Oct 11, 2017 at 9:08 PM, Kevin Layer <layer@franz.com> wrote:

> I want to use solr to index a markdown website.  The files
> are in native markdown, but they are served in HTML (by markserv).
>
> Here's what I did:
>
> docker run --name solr -d -p 8983:8983 -t solr
> docker exec -it --user=solr solr bin/solr create_core -c handbook
>
> Then, to crawl the site:
>
> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
> http://quadra.franz.com:9091/index.md -recursive 10 -delay 0 -filetypes md
> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
> org.apache.solr.util.SimplePostTool http://quadra.franz.com:9091/index.md
> SimplePostTool version 5.0.0
> Posting web pages to Solr url http://localhost:8983/solr/
> handbook/update/extract
> Entering auto mode. Indexing pages with content-types corresponding to
> file endings md
> SimplePostTool: WARNING: Never crawl an external web site faster than
> every 10 seconds, your IP will probably be blocked
> Entering recursive mode, depth=10, delay=0s
> Entering crawl at level 0 (1 links total, 1 new)
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.solr.util.SimplePostTool$PageFetcher.
> readPageFromUrl(SimplePostTool.java:1138)
>         at org.apache.solr.util.SimplePostTool.webCrawl(
> SimplePostTool.java:603)
>         at org.apache.solr.util.SimplePostTool.postWebPages(
> SimplePostTool.java:563)
>         at org.apache.solr.util.SimplePostTool.doWebMode(
> SimplePostTool.java:365)
>         at org.apache.solr.util.SimplePostTool.execute(
> SimplePostTool.java:187)
>         at org.apache.solr.util.SimplePostTool.main(
> SimplePostTool.java:172)
> quadra[git:master]$
>
>
> Any ideas on what I did wrong?
>
> Thanks.
>
> Kevin
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message