SOSP21——Understanding-and-Detecting-Software-Upgrade-Failures-in-Distributed-Systems

The First Step

题目摘要引言

Title

了解和检测分布式系统中的软件升级失败

Abstract

Upgrade 操作时分布式系统可用性中最具破坏性但又不可避免地主要任务之一。Upgrade失败进一步引发服务中断问题,CI的进一步采用使用更加剧了更新问题的频率和负担。 目前还没有任何工作关注upgrade failure的特征。testing framework,DUPTester

Introduction

在本文中,我们将软件升级故障 (software-upgrade failures) 定义为仅在软件升级期间 during software upgrade 发生的故障,比如可以由two code versions of the same software or between an upgrade operation and a regular software operation,并且不会在常规操作中出现问题。

Slides

  • upgrade failures are problematic
    • large-scale
    • persistent impact (can't easily roll back)

传统方法 safe upgrade 慢

fast and safe upgrade

Focusing aspects

  • Symptoms of Upgrade Failures 症状
  • Root-cause study 深层原因
    • 不兼容的跨版本交互(63%)
    • 破碎的升级操作(33%)
    • 错误配置(3%)
    • 破碎的库依赖(2%)
  • Triggering-condition study

Question

Hi, I'm Chuannan Zhang from USTC. thanks for the talk. in the last 2 pages of your slides, you have mentioned that the DUPtool chains captures not only upgrade failure but also downgrade failure, so in your work. do there have any differences or just the same because of the mismatch version of libs and broken operations.


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!