APSys2021云笔记(三)
APSys2021 云笔记(三)
Evaluating Reliability and Usage Characteristics of Flash-Based Storage in Production Systems
Speaker: Bianca Schroeder (Professor at University of Toronto, Canada)
With rapid development of SSDs
Motivation
- Storage landscape has changed
- Stora
Focus on another angke, This talk: the Storage(SSD) reliability.
This talk
- Take a look at flash reliability in the wild.
- How can we protect against flash failures?
Take a look at flash reliability
Issue 1: ARR Annual replacement rates
- [FAST' 20] Drives in Netapp Enterprise Systems.
- AVG ARR 0.22%, but rates vary widely
- much lower than HDDs
Issue 2: Rate of uncorrectable errors
- All SSDs experience bit errors.
- ECC to correct them, but sometimes uncorrectable
- if no data redundancy, lost
Role of Type
SLC, MLC, eMLC and TLC
differenct ARR, shows TLC has the most replacement rates.
Role of Age
- Usage affects the reliability of SSDs, due to wear-out of their cells.
- Hardware Failure ('Bathtub Curve') means SSDs has higher failure rate in their early life and wear-out.
Role of Firmware Version
- Compare individual firmware versions within the same family
FV has a tremendous impact on reliability (by a factor of 3-10x)!
Data protection in enterprise systems
RAID
Single parity rapid -> up to two failures
- How common double failures
RAID group size
Failure correlations within a RAID group
- How frequency do double failures occur?
- How quickly after the first
- From CDF(Time Difference in days - Cumulative Probability), 46% successive failures occur on the same day!
- How are they related to RAID group size?
How do file systems detect and recover from errors?
- ext4
- BtrFS
- ...
Using prediction to improve reliability
- Prediction: model (NN, Random Forest...)
- Usage: improve scrubbing
- Standard (fixed) -> dynamically add factor X with prediction
Interesting thing: the simple model (Random Forest) better than NN. Also mentioned as Question 3 from Zekai Sun in HKU.
DiLOS: Adding Performance to Paging-based Memory Disaggregation
Q&A
- Major difference between DiLOS and LegoOS which also uses memory disaggregagion
uniKernel ?
Coalescent Computing
Speaker: Kyle C. Hale (Illinois Institute of Technology)
Composable Infrastructure
Disaggregation at the Edge ->
Cyber Foraging
- In Cyber Foraging, use devices would "live off the land"
- Applications would be modified to parition into disjoint components, offloaded, sometimes using VMs
- Cloud offload, but with mobility
This might be a "chicken or the egg" problem: ...
Ref J.Flinn Cyber Foraging Fifteen Years Later IEEE Pervasive Computing
What's Changed?
- ...
- Wireless latency continues to drop
Coalescent Computing
Disaggregation resources at the Edge -> Ephemeral Single-System Image at the Edge
Principles and Characteristics
- Transparency: user no aware of coalescence
- Performance: offload shold only occur so as to improve
- Resilience: nodes come and go often
- Customizability: typesof resoucrces, when at what cost...
- Privacy and Security: same problems in IaaS
Lessons Learned from Migrating Complex Stateful Applications onto Serverless Platforms
Gap and Motivation
- Manually migrated 4 microservice applications
- complex and stateful
Sanity Test - Scalability
Increase the load until saturated
Conclusions
- An automatic tool is helpful if it can
- locate the call sites of the caller
- locate the inner handlers
Future: automatic tools instead manually adapt the microservice to serverless
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!