hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark <static.void....@gmail.com>
Subject Help me improve this InputFormat/Loader
Date Thu, 11 Apr 2013 15:56:47 GMT
We have logs stored in HDFS in the following format /YEAR/MONTH/DAY. It's not guaranteed though
that we will have every single day thought so there will be gaps. Now we have some jobs that
require us to retrieve the last X amount of days of data for only days that actually contain

We have something like the following: https://gist.github.com/anonymous/5364554 (The naming
is a little off since its technically not an InputFormat. .any ideas on a proper name?) Basically
it uses retrieves all directory for a given path and sorts them in descending order, limiting
to the last X. It then delegates the setInputPaths to FileInputFormat. Just in case if you
are wondering how we are using it here is an example of a custom PigStorage class we use here:

Although this works, I am thinking there may be a better/easier way to accomplish the same
thing. Any ideas?


- M

View raw message