hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Steffensen <st...@designware.dk>
Subject Re: Routing and region deletes
Date Fri, 09 Dec 2011 07:07:19 GMT
Ahhh stupid me. I probably just want to use different tables for 
different days/months. Believe tables can fairly quickly be deleted on 

Regards, Per Steffensen

Per Steffensen skrev:
> Thanks for your reply!
> Michel Segel skrev:
>> Per Seffensen,
>> I would urge you to step away from the keyboard and rethink your design.
> Will do :-) But would actually still like to receive answers for my 
> questions - just pretend that my ideas are not so stupid and let me 
> know if it can be done
>> It sounds like you want to replicate a date partition model similar 
>> to what you would do if you were attempting this with HBase.
>> HBase is not a relational database and you have a different way of 
>> doing things.
> I know
>> You could put the date/time stamp in the key such that your data is 
>> sorted by date.
> But I guess that would not guarantee that records with timestamps from 
> a specific day or month all exist in the same set of regions and that 
> records with timestamps from other days or months all exist outside 
> those regions, so that I can delete records from that day or month, 
> just by deleting the regions.
>> However, this would cause hot spots.  Think about how you access the 
>> data. It sounds like you access the more recent data more frequently 
>> than historical data.
> Not necessarily wrt reading, but certainly I (almost) only write new 
> records with timestamps from the current day/month.
>>   This is a bad idea in HBase.
>> (note: it may still make sense to do this ... You have to think more 
>> about the data and consider alternatives.)
>> I personally would hash the key for even distribution, again 
>> depending on the data access pattern.  (hashed data means you can't 
>> do range queries but again, it depends on what you are doing...)
>> You also have to think about how you purge the data. You don't just 
>> drop a region.
> I know that this is not the "default" way of deleting data, but it is 
> possible? Believe a region is basically just a folder with a set of 
> files and deleting those would be a matter of a few ms. So if I can 
> route all records with timestamps from a certain day or month to a 
> designated set of regions, deleting all those records will be a matter 
> of deleting #regions-in-that-set folders on disk - very quick. The 
> alternative is to do 50mio+ single delete operations every day (or 1,5 
> billion operations every month), and that will not even free up space 
> immediately since the records will actually just be marked deleted (in 
> a new file) - space will not be freed before next compaction of the 
> involved regions (see e.g. http://outerthought.org/blog/465-ot.html).
>>  Doing a full table scan once a month to delete may not be a bad thing.
> But I dont believe one full table scan will be enough. For that to be 
> possible, at least I would have to be able to provide HBase with all 
> 1,5 billion records to delete in one "delete"-call - thats probably 
> not possible :-)
>>  Again it depends on what you are doing...
>> Just my opinion. Others will have their own... Now I'm stepping away 
>> from the keyboard to get my morning coffee...
> Enjoy. Then I will consider leaving work (its late afternoon in Europe)
>> :-)
>> Sent from a remote device. Please excuse any typos...
>> Mike Segel
>> On Dec 8, 2011, at 7:13 AM, Per Steffensen <steff@designware.dk> wrote:
>>> Hi
>>> The system we are going to work on will receive 50mio+ new 
>>> datarecords every day. We need to keep a history of 2 years of data 
>>> (thats 35+ billion datarecords in the storage all in all), and that 
>>> basically means that we also need to delete 50mio+ datarecords every 
>>> day, or e.g. 1,5 billion every month. We plan to store the 
>>> datarecords in HBase.
>>> Is it somehow possible to tell HBase to put (route) all datarecords 
>>> belonging to a specific date or month to a designated set of regions 
>>> (and route nothing else there), so that deleting all data belonging 
>>> to that day/month i basically deleting those regions entirely? And 
>>> is explicit deletion of entire regions possible at all?
>>> The reason I want to do this is that I expect it to be much faster 
>>> than doing explicit deletion record by record of 50mio+ records 
>>> every day.
>>> Regards, Per Steffensen

View raw message