lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kamuela Lau <kamuela....@gmail.com>
Subject Re: DIH for TikaEntityProcessor
Date Fri, 12 Oct 2018 13:25:44 GMT
Glad to help :)

2018年10月12日(金) 21:10 Martin Frank Hansen (MHQ) <MHQ@kmd.dk>:

> You sir just made my day!!!
>
> It worked!!! Thanks a million!
>
>
> Martin Frank Hansen,
>
> -----Oprindelig meddelelse-----
> Fra: Kamuela Lau <kamuela.lau@gmail.com>
> Sendt: 12. oktober 2018 11:41
> Til: solr-user@lucene.apache.org
> Emne: Re: DIH for TikaEntityProcessor
>
> Also, just wondering, have you have tried to specify dataSource="bin" for
> read_file?
>
> On Fri, Oct 12, 2018 at 6:38 PM Kamuela Lau <kamuela.lau@gmail.com> wrote:
>
> > Hi,
> >
> > I was unable to reproduce the error that you got with the information
> > provided.
> > Below are the data-config.xml and managed-schema fields I used; the
> > data-config is mostly the same (I think that BinFileDataSource doesn't
> > actually require a dataSource, so I think it's safe to put
> > dataSource="null"):
> >
> > <dataConfig>
> >   <dataSource name="bin" type="BinFileDataSource"/>
> >   <document>
> >       <entity name="files" processor="FileListEntityProcessor"
> > baseDir="/path/to/sampleData" fileName=".*doc" recursive="true"
> > rootEntity="false" dataSource="bin" onError="skip">
> >         <field column="fileAbsolutePath" name="id"/>
> >         <entity name="read_file" processor="TikaEntityProcessor"
> > url="${files.fileAbsolutePath}">
> >           <field column="text" name="text"/>
> >         </entity>
> >       </entity>
> >   </document>
> > </dataConfig>
> >
> > And from the managed schema:
> >     <field name="id" type="string" indexed="true" stored="true"
> > required="true" multiValued="false" />
> >     <!-- docValues are enabled by default for long type so we don't
> > need to index the version field  -->
> >     <field name="_version_" type="plong" indexed="false" stored="false"/>
> >     <field name="_root_" type="string" indexed="true" stored="false"
> > docValues="false" />
> >     <field name="text" type="text_general" indexed="true" stored="true"
> > multiValued="true"/>
> >
> > When I had field column="text" name="content", the documents were
> > still indexed, but the text/content was not (as I had no content field
> > in the schema).
> > I used the default config, and Solr version 7.5.0; I was able to
> > import the data just fine (I also tested with .*DOC). Is there any
> > other information you can provide that can help me reproduce this error?
> >
> >
> >
> >
> > On Fri, Oct 12, 2018 at 4:11 PM Martin Frank Hansen (MHQ) <MHQ@kmd.dk>
> > wrote:
> >
> >> Hi again,
> >>
> >>
> >>
> >> Can anybody help me? Any suggestions to why I am getting the error
> below?
> >>
> >>
> >>
> >>
> >>
> >> *Martin Frank Hansen*, Senior Data Analytiker
> >>
> >> Data, IM & Analytics
> >>
> >> [image: cid:image001.png@01D383C9.6C129A60]
> >>
> >>
> >> Lautrupparken 40-42, DK-2750 Ballerup E-mail mhq@kmd.dk  Web
> >> www.kmd.dk Mobil +4525571418
> >>
> >>
> >>
> >> *Fra:* Martin Frank Hansen (MHQ)
> >> *Sendt:* 10. oktober 2018 10:15
> >> *Til:* solr-user <solr-user@lucene.apache.org>
> >> *Emne:* DIH for TikaEntityProcessor
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> I am trying to read documents from a file system into Solr, using
> >> dataimporthandler but keep getting the following errors:
> >>
> >>
> >>
> >> Exception while processing: files document :
> >> null:org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
> >> Throw(DataImportHandlerException.java:61)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:270)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:476)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:517)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:415)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
> >> ava:330)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :233)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
> >> rter.java:424)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
> >> ava:483)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
> >> aImporter.java:466)
> >>
> >>          at java.lang.Thread.run(Thread.java:748)
> >>
> >> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> >> cannot be cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> >> tityProcessor.java:132)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:267)
> >>
> >>          ... 9 more
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Full Import failed:java.lang.RuntimeException:
> >> java.lang.RuntimeException:
> >> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :271)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
> >> rter.java:424)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
> >> ava:483)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Dat
> >> aImporter.java:466)
> >>
> >>          at java.lang.Thread.run(Thread.java:748)
> >>
> >> Caused by: java.lang.RuntimeException:
> >> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:417)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
> >> ava:330)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
> >> :233)
> >>
> >>          ... 4 more
> >>
> >> Caused by:
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> >> java.lang.ClassCastException: java.io.InputStreamReader cannot be
> >> cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAnd
> >> Throw(DataImportHandlerException.java:61)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:270)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:476)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:517)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
> >> r.java:415)
> >>
> >>          ... 6 more
> >>
> >> Caused by: java.lang.ClassCastException: java.io.InputStreamReader
> >> cannot be cast to java.io.InputStream
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
> >> tityProcessor.java:132)
> >>
> >>          at
> >> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
> >> ityProcessorWrapper.java:267)
> >>
> >>          ... 9 more
> >>
> >>
> >>
> >>
> >>
> >> My data-config file looks as follows:
> >>
> >>
> >>
> >> <dataConfig>
> >>
> >>   <dataSource name="bin" type="BinFileDataSource" />
> >>
> >>   <document>
> >>
> >>       <entity name="files" processor="FileListEntityProcessor" baseDir="
> >> D:/CAPTIA/docs/19107" fileName=".*DOC" recursive="true"
> >> rootEntity="false " dataSource="bin" onError="skip">
> >>
> >>         <field column="fileAbsolutePath" name="id" />
> >>
> >>
> >>
> >>         <entity
> >>
> >>          name="read_file"
> >>
> >>          processor="TikaEntityProcessor"
> >>
> >>          url="${files.fileAbsolutePath}"
> >>
> >>          >
> >>
> >>           <field column="text" name="content" />
> >>
> >>         </entity>
> >>
> >>       </entity>
> >>
> >>   </document>
> >>
> >> </dataConfig>
> >>
> >>
> >>
> >> And in the Schema I basically have two fields:
> >>
> >>
> >>
> >> <field name="Id" type="string" indexed="true" stored="true" required="
> >> true" multiValued="false"/>
> >>
> >> <field name="text" type="text_general" indexed="true" stored="false"
> >> multiValued="true"/>
> >>
> >>
> >>
> >> Any help is appreciated.
> >>
> >>
> >>
> >>
> >>
> >> *Martin Frank Hansen*
> >>
> >>
> >>
> >> Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> >> finder du KMD’s Privatlivspolitik
> >> <http://www.kmd.dk/Privatlivspolitik>, der fortæller, hvordan vi
> behandler oplysninger om dig.
> >>
> >> Protection of your personal data is important to us. Here you can
> >> read KMD’s Privacy Policy <http://www.kmd.net/Privacy-Policy>
> >> outlining how we process your personal data.
> >>
> >> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig
> information.
> >> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> >> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> >> beder vi dig slette e-mailen i dit system uden at videresende eller
> kopiere den.
> >> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> >> er fri for virus og andre fejl, som kan påvirke computeren eller
> >> it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> >> eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> >> opstået i forbindelse med at modtage og bruge e-mailen.
> >>
> >> Please note that this message may contain confidential information.
> >> If you have received this message by mistake, please inform the
> >> sender of the mistake by sending a reply, then delete the message
> >> from your system without making, distributing or retaining any copies
> >> of it. Although we believe that the message and any attachments are
> >> free from viruses and other errors that might affect the computer or
> >> it-system where it is received and read, the recipient opens the
> message at his or her own risk.
> >> We assume no responsibility for any loss or damage arising from the
> >> receipt or use of this message.
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message