lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sujay Bawaskar <sujaybawas...@gmail.com>
Subject Re: DIH with huge data
Date Thu, 12 Apr 2018 17:24:54 GMT
That sounds good option. So spark job will connect to MySQL and create solr
document which is pushed into solr using solrj probably in batches.

On Thu, Apr 12, 2018 at 10:48 PM, Rahul Singh <rahul.xavier.singh@gmail.com>
wrote:

> If you want speed, Spark is the fastest easiest way. You can connect to
> relational tables directly and import or export to CSV / JSON and import
> from a distributed filesystem like S3 or HDFS.
>
> Combining a dfs with spark and a highly available SolR - you are
> maximizing all threads.
>
> --
> Rahul Singh
> rahul.singh@anant.us
>
> Anant Corporation
>
> On Apr 12, 2018, 1:10 PM -0400, Sujay Bawaskar <sujaybawaskar@gmail.com>,
> wrote:
> > Thanks Rahul. Data source is JdbcDataSource with MySQL database. Data
> size
> > is around 100GB.
> > I am not much familiar with spark but are you suggesting that we should
> > create document by merging distinct RDBMS tables in using RDD?
> >
> > On Thu, Apr 12, 2018 at 10:06 PM, Rahul Singh <
> rahul.xavier.singh@gmail.com
> > wrote:
> >
> > > How much data and what is the database source? Spark is probably the
> > > fastest way.
> > >
> > > --
> > > Rahul Singh
> > > rahul.singh@anant.us
> > >
> > > Anant Corporation
> > >
> > > On Apr 12, 2018, 7:28 AM -0400, Sujay Bawaskar <
> sujaybawaskar@gmail.com>,
> > > wrote:
> > > > Hi,
> > > >
> > > > We are using DIH with SortedMapBackedCache but as data size
> increases we
> > > > need to provide more heap memory to solr JVM.
> > > > Can we use multiple CSV file instead of database queries and later
> data
> > > in
> > > > CSV files can be joined using zipper? So bottom line is to create CSV
> > > files
> > > > for each of entity in data-config.xml and join these CSV files using
> > > > zipper.
> > > > We also tried EHCache based DIH cache but since EHCache uses MMap IO
> its
> > > > not good to use with MMapDirectoryFactory and causes to exhaust
> physical
> > > > memory on machine.
> > > > Please suggest how can we handle use case of importing huge amount of
> > > data
> > > > into solr.
> > > >
> > > > --
> > > > Thanks,
> > > > Sujay P Bawaskar
> > > > M:+91-77091 53669
> > >
> >
> >
> >
> > --
> > Thanks,
> > Sujay P Bawaskar
> > M:+91-77091 53669
>



-- 
Thanks,
Sujay P Bawaskar
M:+91-77091 53669

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message