lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kay Kay <kaykay.uni...@yahoo.com>
Subject Re: Solr - DataImportHandler - Large Dataset results ?
Date Fri, 12 Dec 2008 22:15:57 GMT
I am using MySQL. I believe (since MySQL 5) supports streaming. 

On more about streaming - can we assume that when the database driver supports streaming ,
the resultset iterator is a forward directional iterator. 

If , say the streaming size is 10K records and we are trying to retrieve a total of 100K records
- what exactly happens when the threshold is reached , (say , the first 10K records were retrieved
). 

Are the previous set of records thrown away and replaced in memory by the new batch of records. 




--- On Fri, 12/12/08, Shalin Shekhar Mangar <shalinmangar@gmail.com> wrote:
From: Shalin Shekhar Mangar <shalinmangar@gmail.com>
Subject: Re: Solr - DataImportHandler - Large Dataset results ?
To: solr-user@lucene.apache.org
Date: Friday, December 12, 2008, 9:41 PM

DataImportHandler is designed to stream rows one by one to create Solr
documents. As long as your database driver supports streaming, you should be
fine. Which database are you using?

On Sat, Dec 13, 2008 at 2:20 AM, Kay Kay <kaykay.unique@yahoo.com> wrote:

> As per the example in the wiki -
> http://wiki.apache.org/solr/DataImportHandler  - I am seeing the following
> fragment.
>
> <dataSource driver="org.hsqldb.jdbcDriver"
> url="jdbc:hsqldb:/temp/example/ex" user="sa" />
>    <document name="products">
>        <entity name="item" query="select * from
item">
>            <field column="ID" name="id" />
>            <field column="NAME" name="name" />
>              ......................
>    </entity>
> </document>
> </dataSource>
>
> My scaled-down application looks very similar along these lines but where
> my resultset is so big that it cannot fit within main memory by any
chance.
>
> So I was planning to split this single query into multiple subqueries -
> with another conditional based on the id . ( id < 0 and id > 100 ,
say ) .
>
> I am curious if there is any way to specify another conditional clause ,
> (<splitData Column = "id"  batch="10000" />,
where the column is supposed to
> be an integer value) - and internally , the implementation could actually
> generate the subqueries -
>
> i) get the min , max of the numeric column , and send queries to the
> database based on the batch size
>
> ii) Add Documents for each batch and close the resultset .
>
> This might end up putting more load on the database (but at least the
> dataset would fit in the main memory ).
>
> Let me know if anyone else had run into similar issues and how this was
> encountered.
>
>
>




-- 
Regards,
Shalin Shekhar Mangar.



      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message