lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Lin Edwin Yeo <edwinye...@gmail.com>
Subject Re: Recursively scan documents for indexing in a folder in SolrJ
Date Mon, 19 Oct 2015 09:56:16 GMT
Yes, I've managed to "steal" some codes from post.jar to only send
rich-text documents format to /update/extract.

I've also change the setting of the Eclipse at Windows -> Preference ->
General -> Workspace. Under Text file encoding, select Other, and choose
UTF-8. The Eclipse is now able to read the Chinese characters successfully.

Thank you for your help.

Regards,
Edwin



On 19 October 2015 at 16:33, Duck Geraint (ext) GBJH <
geraint.duck@syngenta.com> wrote:

> "The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?"
>
> If you've not worked it out yourself yet, try something like:
>
> http://docs.oracle.com/javase/7/docs/api/java/io/File.html#listFiles(java.io.FilenameFilter)
>
> http://stackoverflow.com/questions/5751335/using-file-listfiles-with-filenameextensionfilter
>
> Geraint
>
> Geraint Duck
> Data Scientist
> Toxicology and Health Sciences
> Syngenta UK
> Email: geraint.duck@syngenta.com
>
> -----Original Message-----
> From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
> Sent: 17 October 2015 00:55
> To: solr-user@lucene.apache.org
> Subject: Re: Recursively scan documents for indexing in a folder in SolrJ
>
> Thanks for your advice. I also found this method which so far has been
> able to traverse all the documents in the folder and index them in Solr.
>
> public static void showFiles(File[] files) {
>     for (File file : files) {
>         if (file.isDirectory()) {
>             System.out.println("Directory: " + file.getName());
>             showFiles(file.listFiles()); // Calls same method again.
>         } else {
>             System.out.println("File: " + file.getName());
>         }
>     }}
>
> The problem for this is that it is indexing all the files regardless of
> the formats, instead of just those formats in post.jar. So I guess still
> have to "steal" some codes from there to detect the file format?
>
> As for files that contains non-English characters (Eg; Chinese
> characters), it is currently not able to read the Chinese characters, and
> it is all read as a series of "???". Any idea how to solve this problem?
>
> Thank you.
>
> Regards,
> Edwin
>
>
> On 16 October 2015 at 21:16, Duck Geraint (ext) GBJH <
> geraint.duck@syngenta.com> wrote:
>
> > Also, check this link for SolrJ example code (including the recursion):
> > https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
> >
> > Geraint
> >
> >
> > Geraint Duck
> > Data Scientist
> > Toxicology and Health Sciences
> > Syngenta UK
> > Email: geraint.duck@syngenta.com
> >
> > -----Original Message-----
> > From: Jan Høydahl [mailto:jan.asf@cominvent.com]
> > Sent: 16 October 2015 12:14
> > To: solr-user@lucene.apache.org
> > Subject: Re: Recursively scan documents for indexing in a folder in
> > SolrJ
> >
> > SolrJ does not have any file crawler built in.
> > But you are free to steal code from SimplePostTool.java related to
> > directory traversal, and then index each document found using SolrJ.
> >
> > Note that SimplePostTool.java tries to be smart with what endpoint to
> > post files to, xml, csv and json content will be posted to /update
> > while office docs go to /update/extract
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com
> >
> > > 16. okt. 2015 kl. 05.22 skrev Zheng Lin Edwin Yeo
> > ><edwinyeozl@gmail.com
> > >:
> > >
> > > Hi,
> > >
> > > I understand that in SimplePostTool (post.jar), there is this
> > > command to automatically detect content types in a folder, and
> > > recursively scan it for documents for indexing into a collection:
> > > bin/post -c gettingstarted afolder/
> > >
> > > This has been useful for me to do mass indexing of all the files
> > > that are in the folder. Now that I'm moving to production and plans
> > > to use SolrJ to do the indexing as it can do more things like
> > > robustness checks and retires for indexes that fails.
> > >
> > > However, I can't seems to find a way to do the same in SolrJ. Is it
> > > possible for this to be done in SolrJ? I'm using Solr 5.3.0
> > >
> > > Thank you.
> > >
> > > Regards,
> > > Edwin
> >
> >
> > ________________________________
> >
> >
> > Syngenta Limited, Registered in England No 2710846;Registered Office :
> > Syngenta Limited, European Regional Centre, Priestley Road, Surrey
> > Research Park, Guildford, Surrey, GU2 7YH, United Kingdom
> > ________________________________  This message may contain
> > confidential information. If you are not the designated recipient,
> > please notify the sender immediately, and delete the original and any
> > copies. Any use of the message by you is prohibited.
> >
> ________________________________
>
>
> Syngenta Limited, Registered in England No 2710846;Registered Office :
> Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
> Park, Guildford, Surrey, GU2 7YH, United Kingdom
> ________________________________
>  This message may contain confidential information. If you are not the
> designated recipient, please notify the sender immediately, and delete the
> original and any copies. Any use of the message by you is prohibited.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message