hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Timestamp as a key good practice?
Date Fri, 22 Jun 2012 19:43:54 GMT
Ok. So if I understand correctly, I need:
PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and
ZooKeeper (ZK)
PC2 => Secondary Name Node (Hadoop)
PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server (HBase)

For PC2, should I run Zookeeper, JobTracker and master too? Can I have
2 masters? Or I just run just the secondray name node?

2012/6/21, Michael Segel <michael_segel@hotmail.com>:
> If you have a really small cluster...
> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a
> single node. (Secondary too)
> Then you have Data Nodes that run DN, TT, and RS.
>
> That would solve any ZK RS problems.
>
> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:
>
>> Hi Mike, Hi Rob,
>>
>> Thanks for your replies and advices. Seems that now I'm due for some
>> implementation. I'm readgin Lars' book first and when I will be done I
>> will start with the coding.
>>
>> I already have my Zookeeper/Hadoop/HBase running and based on the
>> first pages I read, I already know it's not well done since I have put
>> a DataNode and a Zookeeper server on ALL the servers ;) So. More
>> reading for me for the next few days, and then I will start.
>>
>> Thanks again!
>>
>> JM
>>
>> 2012/6/16, Rob Verkuylen <rob@verkuylen.net>:
>>> Just to add from my experiences:
>>>
>>> Yes hotspotting is bad, but so are devops headaches. A reasonable
>>> machine
>>> can handle 3-4000 puts a second with ease, and a simple timerange scan
>>> can
>>> give you the records you need. I have my doubts you will be hitting
>>> these
>>> amounts anytime soon. A simple setup will get your PoC and then scale
>>> when
>>> you need to scale.
>>>
>>> Rob
>>>
>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>>> <michael_segel@hotmail.com>wrote:
>>>
>>>> Jean-Marc,
>>>>
>>>> You indicated that you didn't want to do full table scans when you want
>>>> to
>>>> find out which files hadn't been touched since X time has past.
>>>> (X could be months, weeks, days, hours, etc ...)
>>>>
>>>> So here's the thing.
>>>> First,  I am not convinced that you will have hot spotting.
>>>> Second, you end up having to now do 26 scans instead of one. Then you
>>>> need
>>>> to join the result set.
>>>>
>>>> Not really a good solution if you think about it.
>>>>
>>>> Oh and I don't believe that you will be hitting a single region,
>>>> although
>>>> you may hit  a region hard.
>>>> (Your second table's key is on the timestamp of the last update to the
>>>> file.  If the file hadn't been touched in a week, there's the
>>>> probability
>>>> that at scale, it won't be in the same region as a file that had
>>>> recently
>>>> been touched. )
>>>>
>>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
>>>> applied on a subset of problems.
>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>>
>>>>
>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>>
>>>>> Let's imagine the timestamp is "123456789".
>>>>>
>>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>>> between few RegionServers. I will have like "t123456789". The issue is
>>>>> that I will have to do 26 queries to be able to find all the entries.
>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>>> and so on.
>>>>>
>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>>> key myself? Or what if I use something like HBaseWD?
>>>>>
>>>>> JM
>>>>>
>>>>> 2012/6/16, Michel Segel <michael_segel@hotmail.com>:
>>>>>> You can't salt the key in the second table.
>>>>>> By salting the key, you lose the ability to do range scans, which
is
>>>> what
>>>>>> you want to do.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>
>>>>>> Mike Segel
>>>>>>
>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>>> jean-marc@spaggiari.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>>> hotspotting I will try to salt the key in the 2nd table and see
the
>>>>>>> results.
>>>>>>>
>>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>>> machine.
>>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>>
>>>>>>> You are recommending to modify the timestamp to be to the second
or
>>>>>>> minute and have more entries per row. Is that because it's better
to
>>>>>>> have more columns than rows? Or it's more because that will allow
to
>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which
if
>>>>>>> more efficient?
>>>>>>>
>>>>>>> JM
>>>>>>>
>>>>>>> 2012/6/15, Michael Segel <michael_segel@hotmail.com>:
>>>>>>>> Thought about this a little bit more...
>>>>>>>>
>>>>>>>> You will want two tables for a solution.
>>>>>>>>
>>>>>>>> 1 Table is  Key: Unique ID
>>>>>>>>                  Column: FilePath            Value: Full
Path to
>>>>>>>> file
>>>>>>>>                  Column: Last Update time    Value: timestamp
>>>>>>>>
>>>>>>>> 2 Table is Key: Last Update time    (The timestamp)
>>>>>>>>                          Column 1-N: Unique ID    Value:
Full Path
>>>>>>>> to
>>>>>>>> the
>>>>>>>> file
>>>>>>>>
>>>>>>>> Now if you want to get fancy,  in Table 1, you could use
the time
>>>> stamp
>>>>>>>> on
>>>>>>>> the column File Path to hold the last update time.
>>>>>>>> But its probably easier for you to start by keeping the data
as a
>>>>>>>> separate
>>>>>>>> column and ignore the Timestamps on the columns for now.
>>>>>>>>
>>>>>>>> Note the following:
>>>>>>>>
>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given
>>>> timestamp
>>>>>>>> you
>>>>>>>> may or may not have multiple files that were updated. (You
weren't
>>>>>>>> specific
>>>>>>>> as to the scale)
>>>>>>>> This is a good example of HBase's column oriented approach
where
>>>>>>>> you
>>>> may
>>>>>>>> or
>>>>>>>> may not have a column. It doesn't matter. :-) You could also
modify
>>>> the
>>>>>>>> timestamp to be to the second or minute and have more entries
per
>>>>>>>> row.
>>>>>>>> It
>>>>>>>> doesn't matter. You insert based on timestamp:columnName,
value, so
>>>> you
>>>>>>>> will
>>>>>>>> add a column to this table.
>>>>>>>>
>>>>>>>> 2) First prove that the logic works. You insert/update table
1 to
>>>>>>>> capture
>>>>>>>> the ID of the file and its last update time.  You then delete
the
>>>>>>>> old
>>>>>>>> timestamp entry in table 2, then insert new entry in table
2.
>>>>>>>>
>>>>>>>> 3) You store Table 2 in ascending order. Then when you want
to find
>>>> your
>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit
the
>>>>>>>> scan
>>>>>>>> to
>>>>>>>> 500 rows. Note that you may or may not have multiple entries
so as
>>>>>>>> you
>>>>>>>> walk
>>>>>>>> through the result set, you count the number of columns and
stop
>>>>>>>> when
>>>>>>>> you
>>>>>>>> have 500 columns, regardless of the number of rows you've
>>>>>>>> processed.
>>>>>>>>
>>>>>>>> This should solve your problem and be pretty efficient.
>>>>>>>> You can then work out the Coprocessors and add it to the
solution
>>>>>>>> to
>>>> be
>>>>>>>> even
>>>>>>>> more efficient.
>>>>>>>>
>>>>>>>>
>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could
hash
>>>>>>>> your
>>>>>>>> unique
>>>>>>>> ID in table 1, this will reduce the potential of a hotspot
as the
>>>> table
>>>>>>>> splits.
>>>>>>>> On table 2, because you have temporal data and you want to
>>>>>>>> efficiently
>>>>>>>> scan
>>>>>>>> a small portion of the table based on size, you will always
scan
>>>>>>>> the
>>>>>>>> first
>>>>>>>> bloc, however as data rolls off and compression occurs, you
will
>>>>>>>> probably
>>>>>>>> have to do some cleanup. I'm not sure how HBase  handles
splits
>>>>>>>> that
>>>> no
>>>>>>>> longer contain data. When you compress an empty split, does
it go
>>>> away?
>>>>>>>>
>>>>>>>> By switching to coprocessors, you now limit the update accessors
to
>>>> the
>>>>>>>> second table so you should still have pretty good performance.
>>>>>>>>
>>>>>>>> You may also want to look at Asynchronous HBase, however
I don't
>>>>>>>> know
>>>>>>>> how
>>>>>>>> well it will work with Coprocessors or if you want to perform
async
>>>>>>>> operations in this specific use case.
>>>>>>>>
>>>>>>>> Good luck, HTH...
>>>>>>>>
>>>>>>>> -Mike
>>>>>>>>
>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>>>>>>>
>>>>>>>>> Hi Michael,
>>>>>>>>>
>>>>>>>>> For now this is more a proof of concept than a production
>>>> application.
>>>>>>>>> And if it's working, it should be growing a lot and database
at
>>>>>>>>> the
>>>>>>>>> end will easily be over 1B rows. each individual server
will have
>>>>>>>>> to
>>>>>>>>> send it's own information to one centralized server which
will
>>>>>>>>> insert
>>>>>>>>> that into a database. That's why it need to be very quick
and
>>>>>>>>> that's
>>>>>>>>> why I'm looking in HBase's direction. I tried with some
relational
>>>>>>>>> databases with 4M rows in the table but the insert time
is to slow
>>>>>>>>> when I have to introduce entries in bulk. Also, the ability
for
>>>>>>>>> HBase
>>>>>>>>> to keep only the cells with values will allow to save
a lot on the
>>>>>>>>> disk space (futur projects).
>>>>>>>>>
>>>>>>>>> I'm not yet used with HBase and there is still many things
I need
>>>>>>>>> to
>>>>>>>>> undertsand but until I'm able to create a solution and
test it, I
>>>> will
>>>>>>>>> continue to read, learn and try that way. Then at then
end I will
>>>>>>>>> be
>>>>>>>>> able to compare the 2 options I have (HBase or relational)
and
>>>>>>>>> decide
>>>>>>>>> based on the results.
>>>>>>>>>
>>>>>>>>> So yes, your reply helped because it's giving me a way
to achieve
>>>> this
>>>>>>>>> goal (using co-processors). I don't know ye thow this
part is
>>>> working,
>>>>>>>>> so I will dig the documentation for it.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> JM
>>>>>>>>>
>>>>>>>>> 2012/6/14, Michael Segel <michael_segel@hotmail.com>:
>>>>>>>>>> Jean-Marc,
>>>>>>>>>>
>>>>>>>>>> You do realize that this really isn't a good use
case for HBase,
>>>>>>>>>> assuming
>>>>>>>>>> that what you are describing is a stand alone system.
>>>>>>>>>> It would be easier and better if you just used a
simple
>>>>>>>>>> relational
>>>>>>>>>> database.
>>>>>>>>>>
>>>>>>>>>> Then you would have your table w an ID, and a secondary
index on
>>>>>>>>>> the
>>>>>>>>>> timestamp.
>>>>>>>>>> Retrieve the data in Ascending order by timestamp
and take the
>>>>>>>>>> top
>>>> 500
>>>>>>>>>> off
>>>>>>>>>> the list.
>>>>>>>>>>
>>>>>>>>>> If you insist on using HBase, yes you will have to
have a
>>>>>>>>>> secondary
>>>>>>>>>> table.
>>>>>>>>>> Then using co-processors...
>>>>>>>>>> When you update the row in your base table, you
>>>>>>>>>> then get() the row in your index by timestamp, removing
the
>>>>>>>>>> column
>>>> for
>>>>>>>>>> that
>>>>>>>>>> rowid.
>>>>>>>>>> Add the new column to the timestamp row.
>>>>>>>>>>
>>>>>>>>>> As you put it.
>>>>>>>>>>
>>>>>>>>>> Now you can just do a partial scan on your index.
Because your
>>>>>>>>>> index
>>>>>>>>>> table
>>>>>>>>>> is so small... you shouldn't worry about hotspots.
>>>>>>>>>> You may just want to rebuild your index every so
often...
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> -Mike
>>>>>>>>>>
>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari
wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your feedback. Here are more details
to describe what
>>>> I'm
>>>>>>>>>>> trying to achieve.
>>>>>>>>>>>
>>>>>>>>>>> My goal is to store information about files into
the database. I
>>>> need
>>>>>>>>>>> to check the oldest files in the database to
refresh the
>>>> information.
>>>>>>>>>>>
>>>>>>>>>>> The key is an 8 bytes ID of the server name in
the network
>>>>>>>>>>> hosting
>>>>>>>>>>> the
>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes
key.
>>>>>>>>>>>
>>>>>>>>>>> So each time I look at a file and gather the
information, I
>>>>>>>>>>> update
>>>>>>>>>>> its
>>>>>>>>>>> row in the database based on the key including
a "last_update"
>>>> field.
>>>>>>>>>>> I can calculate this key for any file in the
drives.
>>>>>>>>>>>
>>>>>>>>>>> In order to know which file I need to check in
the network, I
>>>>>>>>>>> need
>>>> to
>>>>>>>>>>> scan the table by "last_update" field. So the
idea is to build
>>>>>>>>>>> another
>>>>>>>>>>> table which contain the last_update as a key
and the files IDs
>>>>>>>>>>> in
>>>>>>>>>>> columns. (Here is the hotspotting)
>>>>>>>>>>>
>>>>>>>>>>> Each time I work on a file, I will have to update
the main table
>>>>>>>>>>> by
>>>>>>>>>>> ID
>>>>>>>>>>> and remove the cell from the second table (the
index) and put it
>>>> back
>>>>>>>>>>> with the new "last_update" key.
>>>>>>>>>>>
>>>>>>>>>>> I'm mainly doing 3 operations in the database.
>>>>>>>>>>> 1) I retrieve a list of 500 files which need
to be update
>>>>>>>>>>> 2) I update the information for  those 500 files
(bulk update)
>>>>>>>>>>> 3) I load new files references to be checked.
>>>>>>>>>>>
>>>>>>>>>>> For 2 and 3, I use the main table with the file
ID as the key.
>>>>>>>>>>> the
>>>>>>>>>>> distribution is almost perfect because I'm using
hash. The
>>>>>>>>>>> prefix
>>>> is
>>>>>>>>>>> the server ID but it's not always going to the
same server since
>>>> it's
>>>>>>>>>>> done by last_update. But this allow a quick access
to the list
>>>>>>>>>>> of
>>>>>>>>>>> files from one server.
>>>>>>>>>>> For 1, I have expected to build this second table
with the
>>>>>>>>>>> "last_update" as the key.
>>>>>>>>>>>
>>>>>>>>>>> Regarding the frequency, it really depends on
the activities on
>>>>>>>>>>> the
>>>>>>>>>>> network, but it should be "often".  The faster
the database
>>>>>>>>>>> update
>>>>>>>>>>> will be, the more up to date I will be able to
keep it.
>>>>>>>>>>>
>>>>>>>>>>> JM
>>>>>>>>>>>
>>>>>>>>>>> 2012/6/14, Michael Segel <michael_segel@hotmail.com>:
>>>>>>>>>>>> Actually I think you should revisit your
key design....
>>>>>>>>>>>>
>>>>>>>>>>>> Look at your access path to the data for
each of the types of
>>>>>>>>>>>> queries
>>>>>>>>>>>> you
>>>>>>>>>>>> are going to run.
>>>>>>>>>>>> From your post:
>>>>>>>>>>>> "I have a table with a uniq key, a file path
and a "last
>>>>>>>>>>>> update"
>>>>>>>>>>>> field.
>>>>>>>>>>>>>>> I can easily find back the file
with the ID and find when it
>>>> has
>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But what I need too is to find
the files not updated for
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>> "
>>>>>>>>>>>> So your primary query is going to be against
the key.
>>>>>>>>>>>> Not sure if you meant to say that your key
was a composite key
>>>>>>>>>>>> or
>>>>>>>>>>>> not...
>>>>>>>>>>>> sounds like your key is just the unique key
and the rest are
>>>> columns
>>>>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>> table.
>>>>>>>>>>>>
>>>>>>>>>>>> The secondary query or path to the data is
to find data where
>>>>>>>>>>>> the
>>>>>>>>>>>> files
>>>>>>>>>>>> were
>>>>>>>>>>>> not updated for more than a period of time.
>>>>>>>>>>>>
>>>>>>>>>>>> If you make your key temporal, that is adding
time as a
>>>>>>>>>>>> component
>>>> of
>>>>>>>>>>>> your
>>>>>>>>>>>> key, you will end up creating new rows of
data while the old
>>>>>>>>>>>> row
>>>>>>>>>>>> still
>>>>>>>>>>>> exists.
>>>>>>>>>>>> Not a good side effect.
>>>>>>>>>>>>
>>>>>>>>>>>> The other nasty side effect of using time
as your key is that
>>>>>>>>>>>> you
>>>>>>>>>>>> not
>>>>>>>>>>>> only
>>>>>>>>>>>> have the potential for hot spotting, but
that you also have the
>>>>>>>>>>>> nasty
>>>>>>>>>>>> side
>>>>>>>>>>>> effect of creating splits that will never
grow.
>>>>>>>>>>>>
>>>>>>>>>>>> How often are you going to ask to see the
files where they were
>>>> not
>>>>>>>>>>>> updated
>>>>>>>>>>>> in the last couple of days/minutes? If its
infrequent, then you
>>>>>>>>>>>> really
>>>>>>>>>>>> should care if you have to do a complete
table scan.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Wow! This is exactly what I was looking
for. So I will read
>>>>>>>>>>>>> all
>>>> of
>>>>>>>>>>>>> that
>>>>>>>>>>>>> now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Need to read here at the bottom:
>>>>>>>>>>>>> https://github.com/sematext/HBaseWD
>>>>>>>>>>>>> and here:
>>>>>>>>>>>>>
>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> JM
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <otis_gospodnetic@yahoo.com>:
>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD
(this
>>>> comes
>>>>>>>>>>>>>> up
>>>>>>>>>>>>>> often.... Doug, maybe you could add
it to the Ref Guide?)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Otis
>>>>>>>>>>>>>> ----
>>>>>>>>>>>>>> Performance Monitoring for Solr /
ElasticSearch / HBase -
>>>>>>>>>>>>>> http://sematext.com/spm
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
>>>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012
12:16 PM
>>>>>>>>>>>>>>> Subject: Timestamp as a key good
practice?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I watched Lars George's video
about HBase and read the
>>>>>>>>>>>>>>> documentation
>>>>>>>>>>>>>>> and it's saying that it's not
a good idea to have the
>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> key because that will always
load the same region until the
>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>> reach a certain value and move
to the next region
>>>> (hotspotting).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a table with a uniq key,
a file path and a "last
>>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>> I can easily find back the file
with the ID and find when it
>>>> has
>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But what I need too is to find
the files not updated for
>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I want to retrieve that from
this single table, I will
>>>>>>>>>>>>>>> have
>>>> to
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> full parsing of the table. Which
might take a while.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So I thought of building a table
to reference that (kind of
>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>> index). The key is the "last
update", one FC and each column
>>>> will
>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>> the ID of the file with a dummy
content.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> When a file is updated, I remove
its cell from this table,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> introduce a new cell with the
new timestamp as the key.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And so one.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With this schema, I can find
the files by ID very quickly
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> find the files which need to
be updated pretty quickly too.
>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>> hotspotting one region.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> From the video (0:45:10) I can
see 4 situations.
>>>>>>>>>>>>>>> 1) Hotspotting.
>>>>>>>>>>>>>>> 2) Salting.
>>>>>>>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>>>>>>>> 4) Randomization.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I need to avoid hostpotting,
so I looked at the 3 other
>>>> options.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can do salting. Like prefix
the timestamp with a number
>>>> between
>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>> and 9. So that will distribut
the load over 10 servers. To
>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the files with a timestamp below
a specific value, I will
>>>>>>>>>>>>>>> need
>>>> to
>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>> 10 requests instead of one. But
when the load will becaume
>>>>>>>>>>>>>>> to
>>>> big
>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>> 10 servers, I will have to prefix
by a byte between 0 and
>>>>>>>>>>>>>>> 99?
>>>>>>>>>>>>>>> Which
>>>>>>>>>>>>>>> mean 100 request? And the more
regions I will have, the more
>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>> I will have to do. Is that really
a good approach?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Key field swap is close to salting.
I can add the first few
>>>> bytes
>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>> the path before the timestamp,
but the issue will remain the
>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I looked and randomization, and
I can't do that. Else I will
>>>> have
>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>> way to retreive the information
I'm looking for.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So the question is. Is there
a good way to store the data to
>>>>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>>> them base on the date?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Mime
View raw message