lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrit Sarkar <sarkaramr...@gmail.com>
Subject Re: solr 7.0.1: exception running post to crawl simple website
Date Fri, 13 Oct 2017 13:34:29 GMT
Kevin,

Just put "html" too and give it a shot. These are the types it is expecting:

mimeMap = new HashMap<>();
mimeMap.put("xml", "application/xml");
mimeMap.put("csv", "text/csv");
mimeMap.put("json", "application/json");
mimeMap.put("jsonl", "application/json");
mimeMap.put("pdf", "application/pdf");
mimeMap.put("rtf", "text/rtf");
mimeMap.put("html", "text/html");
mimeMap.put("htm", "text/html");
mimeMap.put("doc", "application/msword");
mimeMap.put("docx",
"application/vnd.openxmlformats-officedocument.wordprocessingml.document");
mimeMap.put("ppt", "application/vnd.ms-powerpoint");
mimeMap.put("pptx",
"application/vnd.openxmlformats-officedocument.presentationml.presentation");
mimeMap.put("xls", "application/vnd.ms-excel");
mimeMap.put("xlsx",
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
mimeMap.put("txt", "text/plain");
mimeMap.put("log", "text/plain");

The keys are the types supported.


Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramrit2@gmail.com>
wrote:

> Ah!
>
> Only supported type is: text/html; encoding=utf-8
>
> I am not confident of this either :) but this should work.
>
> See the code-snippet below:
>
> ......
>
> if(res.httpStatus == 200) {
>   // Raw content type of form "text/html; encoding=utf-8"
>   String rawContentType = conn.getContentType();
>   String type = rawContentType.split(";")[0];
>   if(typeSupported(type) || "*".equals(fileTypes)) {
>     String encoding = conn.getContentEncoding();
>
> ....
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <layer@franz.com> wrote:
>
>> Amrit Sarkar wrote:
>>
>> >> Strange,
>> >>
>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org page's
>> >> Content-Type. Let's see what it says now.
>>
>> Same thing.  Verified Content-Type:
>>
>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>> grep Content-Type
>>   Content-Type: text/html;charset=utf-8
>> quadra[git:master]$ ]
>>
>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c handbook
>> http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> SimplePostTool version 5.0.0
>> Posting web pages to Solr url http://localhost:8983/solr/han
>> dbook/update/extract
>> Entering auto mode. Indexing pages with content-types corresponding to
>> file endings md
>> SimplePostTool: WARNING: Never crawl an external web site faster than
>> every 10 seconds, your IP will probably be blocked
>> Entering recursive mode, depth=10, delay=0s
>> Entering crawl at level 0 (1 links total, 1 new)
>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>> HTTP result status of 415
>> 0 web pages indexed.
>> COMMITting Solr index changes to http://localhost:8983/solr/han
>> dbook/update/extract...
>> Time spent: 0:00:00.531
>> quadra[git:master]$
>>
>> Kevin
>>
>> >>
>> >> Amrit Sarkar
>> >> Search Engineer
>> >> Lucidworks, Inc.
>> >> 415-589-9269
>> >> www.lucidworks.com
>> >> Twitter http://twitter.com/lucidworks
>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>> >>
>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <layer@franz.com> wrote:
>> >>
>> >> > OK, so I hacked markserv to add Content-Type text/html, but now I get
>> >> >
>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> >
>> >> > What is it expecting?
>> >> >
>> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>> >> > /docker-java-home/jre/bin/java -classpath
>> /opt/solr/dist/solr-core-7.0.1.jar
>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>> -Ddata=web
>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>> >> > SimplePostTool version 5.0.0
>> >> > Posting web pages to Solr url http://localhost:8983/solr/
>> >> > handbook/update/extract
>> >> > Entering auto mode. Indexing pages with content-types corresponding
>> to
>> >> > file endings md
>> >> > SimplePostTool: WARNING: Never crawl an external web site faster than
>> >> > every 10 seconds, your IP will probably be blocked
>> >> > Entering recursive mode, depth=10, delay=0s
>> >> > Entering crawl at level 0 (1 links total, 1 new)
>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
>> returned a
>> >> > HTTP result status of 415
>> >> > 0 web pages indexed.
>> >> > COMMITting Solr index changes to http://localhost:8983/solr/
>> >> > handbook/update/extract...
>> >> > Time spent: 0:00:03.882
>> >> > $
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Kevin
>> >> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message