hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tim robertson <timrobertson...@gmail.com>
Subject Re: optimising loading of tab file
Date Wed, 22 Jul 2009 15:21:14 GMT
Below is a sample row (\N are ignored in the Map) so I will try the
default of 2meg which should buffer a bunch before flushing

Thanks for your tips,

Tim

199798861       293     8107    8436    MNHNL   Recorder database
 LUXNATFUND404573t       Pilophorus cinnamopterus (KIRSCHBAUM,18
56)      \N      \N      \N      \N      \N      \N      \N      \N
  \N      \N      49.61   6.13    \N      \N      \N      \N
      \N      \N      \N      \N      \N      \N      \N      L.
Reichling    Parc (Luxembourg)       1979    7       10      \N      \
N      \N      \N      2009-02-20 04:19:51     2009-02-20 08:40:21
\N      199798861       293     8107    29773   1519409 11922838
1       21560621        9917520 \N      \N      \N      \N      \N
 \N      \N      \N      \N      49.61   6.13    50226   61
      186     1979    7       1979-07-10      0       0       0
2       \N      \N      \N      \N


On Wed, Jul 22, 2009 at 5:13 PM, Jean-Daniel Cryans<jdcryans@apache.org> wrote:
> It really depends on the size of each Put. If 1 put = 1MB, then a 2MB
> buffer (the default) won't be useful. A 1GB buffer (what you wrote)
> will likely OOME your client and, if not, your region servers will in
> no time.
>
> So try with the default and then if it goes well you can try setting
> it higher. Do you know the size of each row?
>
> J-D
>
> On Wed, Jul 22, 2009 at 11:04 AM, tim
> robertson<timrobertson100@gmail.com> wrote:
>> Could you suggest a sensible write buffer size please?
>>
>> 1024x1024x1024 bytes?
>>
>> Cheers
>>
>>
>>
>>
>>
>> On Wed, Jul 22, 2009 at 4:41 PM, tim robertson<timrobertson100@gmail.com> wrote:
>>> Thanks J-D
>>>
>>> I will try this now.
>>>
>>> On Wed, Jul 22, 2009 at 3:44 PM, Jean-Daniel Cryans<jdcryans@apache.org>
wrote:
>>>> Tim,
>>>>
>>>> Are you using the write buffer? See HTable.setAutoFlush and
>>>> HTable.setWriteBufferSize if not. This will help a lot.
>>>>
>>>> Also since you have only 4 machines, try setting the HDFS replication
>>>> factor lower than 3.
>>>>
>>>> J-D
>>>>
>>>> On Wed, Jul 22, 2009 at 8:26 AM, tim robertson<timrobertson100@gmail.com>
wrote:
>>>>> Hi all,
>>>>>
>>>>> I have a 70G sparsely populated tab file (74 columns) to load into 2
>>>>> column families in a single HBase table.
>>>>>
>>>>> I am running on my tiny dev cluster (4 mac minis, 4G ram, each running
>>>>> all Hadoop demons and RegionServers) to just familiarise myself, while
>>>>> the proper rack is being set up.
>>>>>
>>>>> I wrote a MapReduce job where I load into HBase during the Map:
>>>>>  String rowID = UUID.randomUUID().toString();
>>>>>  Put row = new Put(rowID.getBytes());
>>>>>  int fields = reader.readAllInto(splits, row);  // uses a properties
>>>>> file to map tab columns to column families
>>>>>  context.setStatus("Map updating cell for row[" + rowID+ "] with " +
>>>>> fields + " fields");
>>>>>  table.put(row);
>>>>>
>>>>> Is this the preferred way to do this kind of loading or is a
>>>>> TableOutputFormat likely to outperform the Map version?
>>>>>
>>>>> [Knowing performance estimates are pointless on this cluster - I see
>>>>> 500 records per sec input, which is a bit disappointing.  I have
>>>>> default Hadoop and HBase config and had to put a ZK quorum on each to
>>>>> get HBase to start]
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Tim
>>>>>
>>>>
>>>
>>
>

Mime
View raw message