hudi-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: [Announce] Clustering feature available in beta
Date Fri, 22 Jan 2021 07:09:09 GMT
This is really really promising! I think the gains will be much higher if
clustered over a larger window of commits!
We can keep improving this over time.

I ll be sure to link the results to the doc updates

On Wed, Jan 20, 2021 at 10:40 PM Satish Kotha <satishkotha@uber.com.invalid>
wrote:

> Hello everyone,
>
> We see ~60% improvement in query runtime for some datasets. See an example
> documented here
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> Please try out this feature and share any feedback.
> I have included commands to run async clustering in the example section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-PerformanceEvaluation
> >.
> You could also setup inline clustering using commands in this section
> <
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+freshness+and+query+performance#RFC19Clusteringdataforfreshnessandqueryperformance-Commandstoscheduleandrunclustering
> >
> .
>
> Thanks
> Satish
>
> On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar <vinoth@apache.org> wrote:
>
> > Please help us test this more, before RC is cut! :)
> >
> > On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha
> <satishkotha@uber.com.invalid
> > >
> > wrote:
> >
> > > Hello all,
> > >
> > > Clustering feature landed <https://github.com/apache/hudi/pull/2263>
> on
> > > master branch and is available in beta. This feature can be used to do
> > > following
> > > 1) Stitch small files into larger files
> > > 2) Change data layout on disk by sorting data using different columns
> > (for
> > > query/storage optimization)
> > >
> > > If you are interested in the above use cases, appreciate it if you can
> > try
> > > out this feature. I have included commands to run clustering in this
> > > section
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+Clustering+data+for+speed+and+query+performance#RFC19Clusteringdataforspeedandqueryperformance-Commandstoscheduleandrunclustering
> > > >
> > > (along
> > > with caveats as this feature is still in beta).
> > >
> > > Any feedback is welcome. I'm also on #general room in slack. Please
> feel
> > > free to ping me if you have any questions/comments.
> > >
> > > Thanks
> > > Satish
> > >
> >
>

Mime
View raw message