incubator-blur-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron McCurry <amccu...@gmail.com>
Subject Re: Max records
Date Fri, 03 May 2013 19:22:22 GMT
Well, it only uses the temp index approach when the limit is hit.
Otherwise it buffers the records in memory and then indexes them per row.

Aaron


On Fri, May 3, 2013 at 3:14 PM, Tim Williams <williamstw@gmail.com> wrote:

> On Fri, May 3, 2013 at 11:05 AM, Aaron McCurry <amccurry@gmail.com> wrote:
> > Ok, so the better approach is to create a second new index and index the
> > entire row into that new small index.  Then once the row is complete,
> close
> > that new writer and index and merge it into the main index.  This allows
> us
> > to index everything and not run the reducer out of memory.
>
> So move to the temporary index approach as the way to do all the M/R
> builds vs just an exception for large rows?
>
> --tim
>
> > On Fri, May 3, 2013 at 10:59 AM, Tim Williams <williamstw@gmail.com>
> wrote:
> >
> >> Thanks, this helps.  I'm looking into patching the BlurReducer so that
> >> when a Row hits maxRecordsPerRow, it indexes what it can of a row - as
> >> opposed to dropping it completely.  What's a better approach? :)
> >>
> >> --tim
> >>
> >> On Fri, May 3, 2013 at 10:44 AM, Aaron McCurry <amccurry@gmail.com>
> wrote:
> >> > BlurTask._maxRecordCount
> >> >
> >> > This is used for testing, so that you can exit a mapper after N
> number of
> >> > records.
> >> >
> >> > BlurTask._maxRecordsPerRow
> >> >
> >> > This will increase the number of records in a single row.  Be careful
> >> with
> >> > this option because this may run the reducer out of memory, I have a
> >> patch
> >> > that I can apply that removes this limit but for now it's still a
> risky
> >> to
> >> > increase this too large/
> >> >
> >> > BlurTask._ramBufferSizeMB
> >> >
> >> > This is the Lucene writer buffer, large values normally increase
> indexing
> >> > throughput.
> >> >
> >> > Aaron
> >> >
> >> >
> >> > On Fri, May 3, 2013 at 10:30 AM, Tim Williams <williamstw@gmail.com>
> >> wrote:
> >> >
> >> >> I have an instance where I need to increase max records per row, but
> >> >> before I do I want to understand the relationship (if there is one)
> >> >> between:
> >> >>
> >> >> BlurTask._maxRecordCount
> >> >> BlurTask._maxRecordsPerRow
> >> >> BlurTask._ramBufferSizeMB
> >> >>
> >> >> I understand maxRecordsPerRow, but in looking into this found I don't
> >> >> understand the _maxRecordCount and/or what interplay might exist with
> >> >> buffer size.
> >> >>
> >> >> Thanks,
> >> >> --tim
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message