lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amrit Sarkar <sarkaramr...@gmail.com>
Subject Re: solr 7.0.1: exception running post to crawl simple website
Date Fri, 13 Oct 2017 13:40:29 GMT
Reference to the code:

.....

String rawContentType = conn.getContentType();
String type = rawContentType.split(";")[0];
if(typeSupported(type) || "*".equals(fileTypes)) {
  String encoding = conn.getContentEncoding();

.....

protected boolean typeSupported(String type) {
  for(String key : mimeMap.keySet()) {
    if(mimeMap.get(key).equals(type)) {
      if(fileTypes.contains(key))
        return true;
    }
  }
  return false;
}

.....

It has another check for fileTypes, I can see the page ending with .md
(which you are indexing) and not .html. Let's hope now this is not the
issue.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2

On Fri, Oct 13, 2017 at 7:04 PM, Amrit Sarkar <sarkaramrit2@gmail.com>
wrote:

> Kevin,
>
> Just put "html" too and give it a shot. These are the types it is
> expecting:
>
> mimeMap = new HashMap<>();
> mimeMap.put("xml", "application/xml");
> mimeMap.put("csv", "text/csv");
> mimeMap.put("json", "application/json");
> mimeMap.put("jsonl", "application/json");
> mimeMap.put("pdf", "application/pdf");
> mimeMap.put("rtf", "text/rtf");
> mimeMap.put("html", "text/html");
> mimeMap.put("htm", "text/html");
> mimeMap.put("doc", "application/msword");
> mimeMap.put("docx", "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
> mimeMap.put("ppt", "application/vnd.ms-powerpoint");
> mimeMap.put("pptx", "application/vnd.openxmlformats-officedocument.presentationml.presentation");
> mimeMap.put("xls", "application/vnd.ms-excel");
> mimeMap.put("xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
> mimeMap.put("odt", "application/vnd.oasis.opendocument.text");
> mimeMap.put("ott", "application/vnd.oasis.opendocument.text");
> mimeMap.put("odp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("otp", "application/vnd.oasis.opendocument.presentation");
> mimeMap.put("ods", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("ots", "application/vnd.oasis.opendocument.spreadsheet");
> mimeMap.put("txt", "text/plain");
> mimeMap.put("log", "text/plain");
>
> The keys are the types supported.
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Fri, Oct 13, 2017 at 6:56 PM, Amrit Sarkar <sarkaramrit2@gmail.com>
> wrote:
>
>> Ah!
>>
>> Only supported type is: text/html; encoding=utf-8
>>
>> I am not confident of this either :) but this should work.
>>
>> See the code-snippet below:
>>
>> ......
>>
>> if(res.httpStatus == 200) {
>>   // Raw content type of form "text/html; encoding=utf-8"
>>   String rawContentType = conn.getContentType();
>>   String type = rawContentType.split(";")[0];
>>   if(typeSupported(type) || "*".equals(fileTypes)) {
>>     String encoding = conn.getContentEncoding();
>>
>> ....
>>
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Fri, Oct 13, 2017 at 6:51 PM, Kevin Layer <layer@franz.com> wrote:
>>
>>> Amrit Sarkar wrote:
>>>
>>> >> Strange,
>>> >>
>>> >> Can you add: "text/html;charset=utf-8". This is wiki.apache.org
>>> page's
>>> >> Content-Type. Let's see what it says now.
>>>
>>> Same thing.  Verified Content-Type:
>>>
>>> quadra[git:master]$ wget -S -O /dev/null http://quadra:9091/index.md |&
>>> grep Content-Type
>>>   Content-Type: text/html;charset=utf-8
>>> quadra[git:master]$ ]
>>>
>>> quadra[git:master]$ docker exec -it --user=solr solr bin/post -c
>>> handbook http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes
>>> md
>>> /docker-java-home/jre/bin/java -classpath /opt/solr/dist/solr-core-7.0.1.jar
>>> -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook -Ddata=web
>>> org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>>> SimplePostTool version 5.0.0
>>> Posting web pages to Solr url http://localhost:8983/solr/han
>>> dbook/update/extract
>>> Entering auto mode. Indexing pages with content-types corresponding to
>>> file endings md
>>> SimplePostTool: WARNING: Never crawl an external web site faster than
>>> every 10 seconds, your IP will probably be blocked
>>> Entering recursive mode, depth=10, delay=0s
>>> Entering crawl at level 0 (1 links total, 1 new)
>>> SimplePostTool: WARNING: Skipping URL with unsupported type text/html
>>> SimplePostTool: WARNING: The URL http://quadra:9091/index.md returned a
>>> HTTP result status of 415
>>> 0 web pages indexed.
>>> COMMITting Solr index changes to http://localhost:8983/solr/han
>>> dbook/update/extract...
>>> Time spent: 0:00:00.531
>>> quadra[git:master]$
>>>
>>> Kevin
>>>
>>> >>
>>> >> Amrit Sarkar
>>> >> Search Engineer
>>> >> Lucidworks, Inc.
>>> >> 415-589-9269
>>> >> www.lucidworks.com
>>> >> Twitter http://twitter.com/lucidworks
>>> >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>> >>
>>> >> On Fri, Oct 13, 2017 at 6:44 PM, Kevin Layer <layer@franz.com>
wrote:
>>> >>
>>> >> > OK, so I hacked markserv to add Content-Type text/html, but now
I
>>> get
>>> >> >
>>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
>>> text/html
>>> >> >
>>> >> > What is it expecting?
>>> >> >
>>> >> > $ docker exec -it --user=solr solr bin/post -c handbook
>>> >> > http://quadra:9091/index.md -recursive 10 -delay 0 -filetypes md
>>> >> > /docker-java-home/jre/bin/java -classpath
>>> /opt/solr/dist/solr-core-7.0.1.jar
>>> >> > -Dauto=yes -Drecursive=10 -Ddelay=0 -Dfiletypes=md -Dc=handbook
>>> -Ddata=web
>>> >> > org.apache.solr.util.SimplePostTool http://quadra:9091/index.md
>>> >> > SimplePostTool version 5.0.0
>>> >> > Posting web pages to Solr url http://localhost:8983/solr/
>>> >> > handbook/update/extract
>>> >> > Entering auto mode. Indexing pages with content-types corresponding
>>> to
>>> >> > file endings md
>>> >> > SimplePostTool: WARNING: Never crawl an external web site faster
>>> than
>>> >> > every 10 seconds, your IP will probably be blocked
>>> >> > Entering recursive mode, depth=10, delay=0s
>>> >> > Entering crawl at level 0 (1 links total, 1 new)
>>> >> > SimplePostTool: WARNING: Skipping URL with unsupported type
>>> text/html
>>> >> > SimplePostTool: WARNING: The URL http://quadra:9091/index.md
>>> returned a
>>> >> > HTTP result status of 415
>>> >> > 0 web pages indexed.
>>> >> > COMMITting Solr index changes to http://localhost:8983/solr/
>>> >> > handbook/update/extract...
>>> >> > Time spent: 0:00:03.882
>>> >> > $
>>> >> >
>>> >> > Thanks.
>>> >> >
>>> >> > Kevin
>>> >> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message