Ok thanks, but can I also update data instead of insert data ?
----- Mail original -----
De: "Brett Antonides" <bantonid@gmail.com>
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem
I had a very similar problem and solved it with Hive and ORC files using the Spark SQLContext.
* Create a table in Hive stored as an ORC file (I recommend using partitioning too)
* Use SQLContext.sql to Insert data into the table
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small
files into larger files optimized for your HDFS block size
* Since the CONCATENATE command operates on files in place it is transparent to any downstream
processing
Cheers,
Brett
On Fri, Oct 2, 2015 at 3:48 PM, < nibiau@free.fr > wrote:
Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old existing file
by the new one (name of file being the messageID), is it possible with archive ?
Tks
Nicolas
----- Mail original -----
De: "Jörn Franke" < jornfranke@gmail.com >
À: nibiau@free.fr , "user" < user@spark.apache.org >
Envoyé: Lundi 28 Septembre 2015 23:53:56
Objet: Re: HDFS small file generation problem
Use hadoop archive
Le dim. 27 sept. 2015 à 15:36, < nibiau@free.fr > a écrit :
Hello,
I'm still investigating my small file generation problem generated by my Spark Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have
to store them inside HDFS in order to treat them by PIG jobs on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several millions) and
it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?
Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?
Tks a lot
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
|