Step 1

题目摘要引言

基本理论概况

结论部分

回答基本问题

类别

内容

正确性

创新点

清晰度

阅读选择

Step 2

细读笔记

问题记录

未读（且值得读）文献记录

Step 3

思路复现

证明与推理复现

实验验证复现

ReadingGroup 会议

Motivation

图

PC2：Pytorch/Caffee2

MX：MxNet（）

GV：Gradient Compression Enable（梯度压缩）

Q：ec2 和 Azure 时可以设置“部署”在同一个机架？好像不能这个拓扑？ A：现在提出的就是，提出一种算法，探测这种拓扑。

Inefficiencies in Existing Approaches

拓扑感知

Design and Implementation

Idea #1: Two Level Hierarchial Aggregation
- HA does not reduce the total amount of data transferred on the wire, but create more localized traffic and avoid slow links
- 为什么选择两级，经验法则选取
  - Step. 数据转入 buffer，分块，选出 local master
Idea #2: Capturing Network Locality with ProbeEmbed(？嵌入)
- 嵌入到一个欧式空间（优化一个最小值）
- Grouping nodes
  - \(k+\frac{n}{k}\)，\(k = \sqrt{n}\)取最值。
  - 用 K-means
  - 我问的问题：\(a\)参数是可调的（\(a\)越大越“推开”）
Idea #3：Reacting to Network Changes with Autotune
- 核心想法：将负载从 bottleneck node 转移走，基于blame（惩罚系数）
- 惩罚系数构成：
  - \(t(i), l(i), B(i)\)

Summary

PLink 工作包括：
- Topology-aware
- Hierarchical aggregation
- Autotune
Limitations：
- Can't get enough benefit from finetune
- The complexity of topology-aware 是 O(n)

Research Paper

三步法 Paper

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

软件测试学习——期末复习汇总上一篇

分布式系统学习——SOSP2006亚马逊DynamoKV存储系统下一篇

USTCReadingGroup——Cloud-Based-Distributed-Training