APSys2021云笔记(三)

APSys2021 云笔记(三)

Evaluating Reliability and Usage Characteristics of Flash-Based Storage in Production Systems

Speaker: Bianca Schroeder (Professor at University of Toronto, Canada)

With rapid development of SSDs

Motivation

  • Storage landscape has changed
  • Stora

Focus on another angke, This talk: the Storage(SSD) reliability.

This talk

  • Take a look at flash reliability in the wild.
  • How can we protect against flash failures?

Take a look at flash reliability

Issue 1: ARR Annual replacement rates

  • [FAST' 20] Drives in Netapp Enterprise Systems.
  • AVG ARR 0.22%, but rates vary widely
  • much lower than HDDs

Issue 2: Rate of uncorrectable errors

  • All SSDs experience bit errors.
  • ECC to correct them, but sometimes uncorrectable
    • if no data redundancy, lost
Keynote_2_01
Keynote_2_01

Role of Type

SLC, MLC, eMLC and TLC

differenct ARR, shows TLC has the most replacement rates.

Role of Age

  • Usage affects the reliability of SSDs, due to wear-out of their cells.
  • Hardware Failure ('Bathtub Curve') means SSDs has higher failure rate in their early life and wear-out.

Role of Firmware Version

  • Compare individual firmware versions within the same family

FV has a tremendous impact on reliability (by a factor of 3-10x)!

Data protection in enterprise systems

RAID

  • Single parity rapid -> up to two failures

  • How common double failures
  • RAID group size

Failure correlations within a RAID group

  • How frequency do double failures occur?
  • How quickly after the first
    • From CDF(Time Difference in days - Cumulative Probability), 46% successive failures occur on the same day!
  • How are they related to RAID group size?
Keynote_2_02
Keynote_2_02

How do file systems detect and recover from errors?

  • ext4
  • BtrFS
  • ...
Keynote_2_03
Keynote_2_03

Using prediction to improve reliability

  • Prediction: model (NN, Random Forest...)
  • Usage: improve scrubbing
    • Standard (fixed) -> dynamically add factor X with prediction

Interesting thing: the simple model (Random Forest) better than NN. Also mentioned as Question 3 from Zekai Sun in HKU.

DiLOS: Adding Performance to Paging-based Memory Disaggregation

Q&A

  • Major difference between DiLOS and LegoOS which also uses memory disaggregagion

uniKernel ?

Coalescent Computing

Speaker: Kyle C. Hale (Illinois Institute of Technology)

Composable Infrastructure

Disaggregation at the Edge ->

Cyber Foraging

  • In Cyber Foraging, use devices would "live off the land"
  • Applications would be modified to parition into disjoint components, offloaded, sometimes using VMs
  • Cloud offload, but with mobility

This might be a "chicken or the egg" problem: ...

Ref J.Flinn Cyber Foraging Fifteen Years Later IEEE Pervasive Computing

What's Changed?

  • ...
  • Wireless latency continues to drop

Coalescent Computing

Disaggregation resources at the Edge -> Ephemeral Single-System Image at the Edge

Principles and Characteristics

  • Transparency: user no aware of coalescence
  • Performance: offload shold only occur so as to improve
  • Resilience: nodes come and go often
  • Customizability: typesof resoucrces, when at what cost...
  • Privacy and Security: same problems in IaaS

Lessons Learned from Migrating Complex Stateful Applications onto Serverless Platforms

Gap and Motivation

  • Manually migrated 4 microservice applications
    • complex and stateful

Sanity Test - Scalability

Increase the load until saturated

Conclusions

  • An automatic tool is helpful if it can
    • locate the call sites of the caller
    • locate the inner handlers

Future: automatic tools instead manually adapt the microservice to serverless