hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prakash Kadel <prakash.ka...@gmail.com>
Subject Re: coprocessor enabled put very slow, help please~~~
Date Wed, 20 Feb 2013 15:10:00 GMT
sorry for all these unclear queries.

i turned of WAL on both the doc and index table.

in my system all documents have a UUID (assigned before it comes into the system) i just use
this UUID as the rowkey. so duplicates basically means documents with the same id, even if
the contents are the same.
for a poem like Mary had a little lamb, the whole poem would probably be counted as a single
document. if such a   document comes, the word counts of the words in the poem would increment
by their count in the poem.
if multiple docs have the same content but different id, i just treat them as different docs
and do the increments.


Sincerely,
Prakash Kadel

On Feb 20, 2013, at 11:14 PM, Michel Segel <michael_segel@hotmail.com> wrote:

> 
> What happens when you have a poem like Mary had a little lamb?
> 
> Did you turn off the WAL on both table inserts, or just the index?
> 
> If you want to avoid processing duplicate docs... You could do this a couple of ways.
The simplest way is to record the doc ID and a check sum for the doc. If the doc you are processing
matches... You can simply do NOOP for the lines in the doc. (This isn't the fastest, but its
easy.)
> The other is to run a preprocess which removes duplicate doc from your directory and
you then process the docs...
> 
> Third thing... Do a code review. Sloppy code will kill performance...
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Feb 20, 2013, at 5:26 AM, Prakash Kadel <prakash.kadel@gmail.com> wrote:
> 
>> michael, 
>>  infact i dont care about latency bw doc write and index write.
>> today i did some tests.
>> turns out turning off WAL does speed up the writes by about a factor of 2.
>> interestingly, enabling bloom filter did little to improve the checkandput.
>> 
>> earlier you mentioned
>>>>>> The OP doesn't really get in to the use case, so we don't know why
the
>>>>> Check and Put in the M/R job.
>>>>>> He should just be using put() and then a postPut().
>> 
>> 
>> the main reason i use checkandput is to make sure the word count index doesnt get
duplicate increments when duplicate documents come in. additionally i also need to dump dup
free docs to hdfs for legacy system that we have in place.
>> is there some way to avoid chechandput?
>> 
>> 
>> Sincerely,
>> Prakash 
>> 
>> On Feb 20, 2013, at 10:00 PM, Michel Segel <michael_segel@hotmail.com> wrote:
>> 
>>> I was suggesting removing the write to WAL on your write to the index table only.
>>> 
>>> The thing you have to realize that true low latency systems use databases as
a sink. It's the end of the line so to speak.
>>> 
>>> So if you're worried about a small latency between the writing to your doc table,
and then the write of your index.. You are designing the wrong system.
>>> 
>>> Consider that it takes some time t to write the base record and then to write
the indexes.
>>> For that period, you have a Schrödinger's cat problem as to if the row exists
or not. Since HBase lacks transactions and ACID, trying to write a solution where you require
the low latency... You are using the wrong tool.
>> 

Mime
View raw message