lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Indexing logs files of thousands of GBs
Date Wed, 23 Oct 2013 17:52:30 GMT
As a supplement to what Chris said, if you can
partition the walking amongst a number of clients
you can also parallelize the indexing. If you're using
SolrCloud 4.5+, there are also some nice optimizations
in SolrCloud to keep intra-shard routing to a minimum.

FWIW,
Erick


On Wed, Oct 23, 2013 at 12:59 PM, Chris Geeringh <geeringh@gmail.com> wrote:

> Prerna,
>
> The FileListEntityProcessor has a terribly inefficient recursive method,
> which will be using up all your heap building a list of files.
>
> I would suggest writing a client application and traverse your filesystem
> with NIO available in Java 7. Files.walkFileTree() and a FileVisitor.
>
> As you "walk" post up to the server with SolrJ.
>
> Cheers,
> Chris
>
>
> On 22 October 2013 18:58, keshari.prerna <keshari.prerna@gmail.com> wrote:
>
> > Hello,
> >
> > I am tried to index log files (all text data) stored in file system. Data
> > can be as big as 1000 GBs or more. I am working on windows.
> >
> > A sample file can be found at
> > https://www.dropbox.com/s/mslwwnme6om38b5/batkid.glnxa64.66441
> >
> > I tried using FileListEntityProcessor with TikaEntityProcessor which
> ended
> > up in java heap exception and couldn't get rid of it no matter how much I
> > increase my ram size.
> > data-confilg.xml
> >
> > <dataConfig>
> >     <dataSource name="bin" type="FileDataSource" />
> >     <document>
> >         <entity name="f" dataSource="null" rootEntity="true"
> >             processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> >             baseDir="//mathworks/devel/bat/A/logs/66048/"
> >             fileName=".*\.*" onError="skip" recursive="true">
> >
> >             <field column="fileAbsolutePath" name="path" />
> >             <field column="fileSize" name="size"/>
> >             <field column="fileLastModified" name="lastmodified" />
> >
> >             <entity name="file" dataSource="bin"
> > processor="TikaEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip" transformer="TemplateTransformer"
> >            rootEntity="true">
> >                 <field column="text" name="text"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Then i used FileListEntityProcessor with LineEntityProcessor which never
> > stopped indexing even after 40 hours or so.
> >
> > data-config.xml
> >
> > <dataConfig>
> >     <dataSource name="bin" type="FileDataSource" />
> >     <document>
> >         <entity name="f" dataSource="null" rootEntity="true"
> >             processor="FileListEntityProcessor"
> > transformer="TemplateTransformer"
> >             baseDir="//mathworks/devel/bat/A/logs/"
> >             fileName=".*\.*" onError="skip" recursive="true">
> >
> >             <field column="fileAbsolutePath" name="path" />
> >             <field column="fileSize" name="size"/>
> >             <field column="fileLastModified" name="lastmodified" />
> >
> >             <entity name="file" dataSource="bin"
> > processor="LineEntityProcessor" url="${f.fileAbsolutePath}" format="text"
> > onError="skip"
> >            rootEntity="true">
> >                 <field column="content" name="rawLine"/>
> >             </entity>
> >         </entity>
> >     </document>
> > </dataConfig>
> >
> > Is there any way i can use post.jar to index text file recursively. Or
> any
> > other way which works without java heap exception and doesn't take days
> to
> > index.
> >
> > I am completely stuck here. Any help would be greatly appreciated.
> >
> > Thanks,
> > Prerna
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Indexing-logs-files-of-thousands-of-GBs-tp4097073.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message