lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Gerlowski (JIRA)" <>
Subject [jira] [Commented] (SOLR-11640) QuickStart Tutorial indexes post.jar, other unexpected files
Date Mon, 13 Nov 2017 18:01:00 GMT


Jason Gerlowski commented on SOLR-11640:

It's possible this is a "feature" and not a "bug".  If that's the case, maybe we should clarify
the "File endings considered are...." message output by SimplePostTool, as it implies a whitelist
of file extensions.

> QuickStart Tutorial indexes post.jar, other unexpected files
> ------------------------------------------------------------
>                 Key: SOLR-11640
>                 URL:
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation, scripts and tools
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Priority: Trivial
> Currently, the QuickStart tutorial included in the ref guide involves running the following
command to index some example documents: {{bin/post -c techproducts example/exampledocs/*}}
> This ends up attempting to index _all_ the files in that directory, which includes the
expected example files, but also as bash script called {{}} and the {{post.jar}}
JAR file itself.
> The subsequent tutorial step involves searching results, which can bring up the ugly
> {code}
>       {
>         "id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
>         "resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
>         "content_type":["application/java-archive"],
>         "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n   \n  META-INF/MANIFEST.MF
\n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant 1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12
(Oracle Corp
> orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n \n  \n
 org/apache/solr/util/RTimer$1.class \n  package  org.apache.solr.util;\n synchronized   class
 RTimer$1 {\n}\n \n\n \n  \n  o
> rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n  package  org.apache.solr.util;\n
synchronized   class  RTimer$NanoTimeTimerImpl  implements  RTimer$TimerImpl {\n     private
 long  start ;\n     private  
> void RTimer$NanoTimeTimerImpl();\n     public  void  start ();\n     public  double 
elapsed ();\n}\n \n\n \n  \n  org/apache/solr/util/RTimer$TimerImpl.class \n  package  org.apache.solr.util;\n
public   abstra
> ct   interface  RTimer$TimerImpl {\n     public   abstract  void  start ();\n     public
  abstract  double  elapsed ();\n}\n \n\n \n  \n  org/apache/solr/util/RTimer.class \n  package
 org.apache.solr.util;\n p
> ublic   synchronized   class  RTimer {\n     public   static   final  int  STARTED  =
0;\n     public   static   final  int  STOPPED  = 1;\n     public   static   final  int  PAUSED
 = 2;\n     protected  int  s
>   ......[remaining code skipped for brevity]........"],
>         "_version_":1583971861929132032},
> {code}
> It's honestly pretty cool that TIKA can extract code from our post.jar file.  It makes
sense, but I didn't expect it.  But it's probably not what we intended to show to new users.
 Especially considering that the bin/post invocation in the quick-start tutorial claims to
be choosy about what filetypes it will index:
> {code}
> Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> {code}
> From a quick glance at things, it looks like {{bin/post}} does pass a list of permissible
filetypes to the underlying {{SimplePostTool}}, but that SimplePostTool doesn't follow this
extension whitelist in the particular mode being invoked by the quickstart tutorial.  So this
is probably a wider bug, that the quickstart/tutorial just happens to expose.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message