lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Singh <rahul.xavier.si...@gmail.com>
Subject Re: DIH with huge data
Date Thu, 12 Apr 2018 17:23:18 GMT

CSV -> Spark -> SolR

https://github.com/lucidworks/spark-solr/blob/master/docs/examples/csv.adoc

If speed is not an issue there are other methods. Spring Batch / Spring Data might have all
the tools you need to get speed without Spark.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawaskar@gmail.com>, wrote:
> Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data size
> is around 100GB.
> I am not much familiar with spark but are you suggesting that we should
> create document by merging distinct RDBMS tables in using RDD?
>
> On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <rahul.xavier.singh@gmail.com
> wrote:
>
> > How much data and what is the database source? Spark is probably the
> > fastest way.
> >
> > --
> > Rahul Singh
> > rahul.singh@anant.us
> >
> > Anant Corporation
> >
> > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <sujaybawaskar@gmail.com>,
> > wrote:
> > > Hi,
> > >
> > > We are using DIH with SortedMapBackedCache but as data size increases we
> > > need to provide more heap memory to solr JVM.
> > > Can we use multiple CSV file instead of database queries and later data
> > in
> > > CSV files can be joined using zipper? So bottom line is to create CSV
> > files
> > > for each of entity in data-config.xml and join these CSV files using
> > > zipper.
> > > We also tried EHCache based DIH cache but since EHCache uses MMap IO its
> > > not good to use with MMapDirectoryFactory and causes to exhaust physical
> > > memory on machine.
> > > Please suggest how can we handle use case of importing huge amount of
> > data
> > > into solr.
> > >
> > > --
> > > Thanks,
> > > Sujay P Bawaskar
> > > M:+91-77091 53669
> >
>
>
>
> --
> Thanks,
> Sujay P Bawaskar
> M:+91-77091 53669

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message