pinot-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [pinot] suddendust commented on issue #7229: Pinot Long Term Data Store
Date Wed, 04 Aug 2021 19:34:01 GMT

suddendust commented on issue #7229:
URL: https://github.com/apache/pinot/issues/7229#issuecomment-892919348


   Thanks @mcvsubbu for the review and the pointers. As for the FSM, this is the new state
I proposed:
   
   <img width="796" alt="Screenshot 2021-08-05 at 12 00 38 AM" src="https://user-images.githubusercontent.com/84911643/128235324-23760e70-861d-43ca-b6ab-a6e338ba20b6.png">
   
   The reason I introduced a new state is because it looked like a cleaner way for the broker
to determine that a segment has been purged and has been moved to the deep-store. Certainly,
the broker can also determine this by first determining that the segment is absent, and then
looking at its S3 location in the segment config. Just that the code will be a bit less clean
in this case. With the new state, I was thinking of defining an invariant that a segment moves
to this state _iff_ it was successfully uploaded AND its URL was successfully updated in its
metadata (so the broker can be sure that the segment was actually uploaded just by looking
at the new state, can be helpful in case when the deep-store was [bypassed](https://cwiki.apache.org/confluence/display/PINOT/By-passing+deep-store+requirement+for+Realtime+segment+completion)
during commit for some reason). But on second thoughts it appears this adds unnecessary complexity.
   
   We haven't really done a cost comparison of lazy-loading vs. `mmap` on local Pinot servers.
But I'll throw some numbers here. Our ingestion rate is increasing quite rapidly and we're
looking at around 4-5T/day of data in the next few months (these are conservative numbers).
With a retention period of 30 days (again min.), we'll have to store 150T worth of segments
on SSDs at any time. Storage costs can be prohibitive with this much data. Not to mention
all of this to serve a tiny amount of queries (< 10%) We'll try to do a proper cost analysis
of this today.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Mime
View raw message