spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nib...@free.fr
Subject Re: RE : Re: HDFS small file generation problem
Date Sat, 03 Oct 2015 14:48:56 GMT
Thanks a lot, why you said "the most recent version" ?

----- Mail original -----
De: "Jörn Franke" <jornfranke@gmail.com>
À: "nibiau" <nibiau@free.fr>
Cc: bantonid@gmail.com, user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 13:56:43
Objet: Re: RE : Re: HDFS small file generation problem



Yes the most recent version yes, or you can use phoenix on top of hbase. I recommend to try
out both and see which one is the most suitable. 



Le sam. 3 oct. 2015 à 13:13, nibiau < nibiau@free.fr > a écrit : 




Hello, 
Thanks if I understand correctly Hive can be a usable to my context ? 


Nicolas 









Envoyé depuis mon appareil mobile Samsung 
Jörn Franke < jornfranke@gmail.com > a écrit : 



If you use transactional tables in hive together with insert, update, delete then it does
the "concatenate " for you automatically in regularly intervals. Currently this works only
with tables in orc.format (stored as orc) 




Le sam. 3 oct. 2015 à 11:45, < nibiau@free.fr > a écrit : 


Hello, 
So, does Hive is a solution for my need : 
- I receive small messages (10KB) identified by ID (product ID for example) 
- Each message I receive is the last picture of my product ID, so I just want basically to
store last picture products inside HDFS 
in order to process batch on it later. 

If I use Hive I suppose I have to use INSERT and UPDATE records and periodically CONCATENATE.

After a CONCATENATE I suppose the records are still updatable. 

Tks to confirm if it can be solution for my use case. Or any other idea.. 

Thanks a lot ! 
Nicolas 


----- Mail original ----- 
De: "Jörn Franke" < jornfranke@gmail.com > 
À: nibiau@free.fr , "Brett Antonides" < bantonid@gmail.com > 
Cc: user@spark.apache.org 
Envoyé: Samedi 3 Octobre 2015 11:17:51 
Objet: Re: HDFS small file generation problem 



You can update data in hive if you use the orc format 



Le sam. 3 oct. 2015 à 10:42, < nibiau@free.fr > a écrit : 


Hello, 
Finally Hive is not a solution as I cannot update the data. 
And for archive file I think it would be the same issue. 
Any other solutions ? 

Nicolas 

----- Mail original ----- 
De: nibiau@free.fr 
À: "Brett Antonides" < bantonid@gmail.com > 
Cc: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:37:22 
Objet: Re: HDFS small file generation problem 

Ok thanks, but can I also update data instead of insert data ? 

----- Mail original ----- 
De: "Brett Antonides" < bantonid@gmail.com > 
À: user@spark.apache.org 
Envoyé: Vendredi 2 Octobre 2015 18:18:18 
Objet: Re: HDFS small file generation problem 








I had a very similar problem and solved it with Hive and ORC files using the Spark SQLContext.

* Create a table in Hive stored as an ORC file (I recommend using partitioning too) 
* Use SQLContext.sql to Insert data into the table 
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge your many small
files into larger files optimized for your HDFS block size 
* Since the CONCATENATE command operates on files in place it is transparent to any downstream
processing 

Cheers, 
Brett 









On Fri, Oct 2, 2015 at 3:48 PM, < nibiau@free.fr > wrote: 


Hello, 
Yes but : 
- In the Java API I don't find a API to create a HDFS archive 
- As soon as I receive a message (with messageID) I need to replace the old existing file
by the new one (name of file being the messageID), is it possible with archive ? 

Tks 
Nicolas 

----- Mail original ----- 
De: "Jörn Franke" < jornfranke@gmail.com > 
À: nibiau@free.fr , "user" < user@spark.apache.org > 
Envoyé: Lundi 28 Septembre 2015 23:53:56 
Objet: Re: HDFS small file generation problem 





Use hadoop archive 



Le dim. 27 sept. 2015 à 15:36, < nibiau@free.fr > a écrit : 


Hello, 
I'm still investigating my small file generation problem generated by my Spark Streaming jobs.

Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), and I have
to store them inside HDFS in order to treat them by PIG jobs on-demand. 
The problem is the fact that I generate a lot of small files in HDFS (several millions) and
it can be problematic. 
I investigated to use Hbase or Archive file but I don't want to do it finally. 
So, what about this solution : 
- Spark streaming generate on the fly several millions of small files in HDFS 
- Each night I merge them inside a big daily file 
- I launch my PIG jobs on this big file ? 

Other question I have : 
- Is it possible to append a big file (daily) by adding on the fly my event ? 

Tks a lot 
Nicolas 

--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 


--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 



--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 


--------------------------------------------------------------------- 
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org 
For additional commands, e-mail: user-help@spark.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message