算力探查器
在本教程中,我们将介绍 DeepSpeed 算力探查器并提供其用法的示例。
概述
有效利用硬件资源对于获得良好的性能至关重要,但在现有的大规模模型训练和推理实现中,性能低效通常难以发现,也难以归因于特定的模块组件。DeepSpeed 算力探查器帮助用户轻松测量模型训练/推理速度(延迟、吞吐量)和效率(每秒浮点运算次数,即 FLOPS),以及模型及其子模块的效率,旨在消除现有实现中的低效之处。
以下是 BERT-Large(NVIDIA)在具有批次大小 80
的 A100 GPU 上的示例输出
-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)
world size: 1
data parallel size: 1
model parallel size: 1
batch size per GPU: 80
params per gpu: 336.23 M
params of model = params per GPU * mp_size: 336.23 M
fwd MACs per GPU: 3139.93 G
fwd flops per GPU: 6279.86 G
fwd flops of model = fwd flops per GPU * mp_size: 6279.86 G
fwd latency: 76.67 ms
bwd latency: 108.02 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 81.9 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 116.27 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 102.0 TFLOPS
step latency: 34.09 us
iter latency: 184.73 ms
samples/second: 433.07
----------------------------- Aggregated Profile per GPU -----------------------------
Top modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
params - {'BertForPreTrainingPreLN': '336.23 M'}
MACs - {'BertForPreTrainingPreLN': '3139.93 GMACs'}
fwd latency - {'BertForPreTrainingPreLN': '76.39 ms'}
depth 1:
params - {'BertModel': '335.15 M', 'BertPreTrainingHeads': '32.34 M'}
MACs - {'BertModel': '3092.96 GMACs', 'BertPreTrainingHeads': '46.97 GMACs'}
fwd latency - {'BertModel': '34.29 ms', 'BertPreTrainingHeads': '3.23 ms'}
depth 2:
params - {'BertEncoder': '302.31 M', 'BertLMPredictionHead': '32.34 M'}
MACs - {'BertEncoder': '3092.88 GMACs', 'BertLMPredictionHead': '46.97 GMACs'}
fwd latency - {'BertEncoder': '33.45 ms', 'BertLMPredictionHead': '2.61 ms'}
depth 3:
params - {'ModuleList': '302.31 M', 'Embedding': '31.79 M', 'Linear': '31.26 M'}
MACs - {'ModuleList': '3092.88 GMACs', 'Linear': '36.23 GMACs'}
fwd latency - {'ModuleList': '33.11 ms', 'BertPredictionHeadTransform': '1.83 ms''}
depth 4:
params - {'BertLayer': '302.31 M', 'LinearActivation': '1.05 M''}
MACs - {'BertLayer': '3092.88 GMACs', 'LinearActivation': '10.74 GMACs'}
fwd latency - {'BertLayer': '33.11 ms', 'LinearActivation': '1.43 ms'}
depth 5:
params - {'BertAttention': '100.76 M', 'BertIntermediate': '100.76 M'}
MACs - {'BertAttention': '1031.3 GMACs', 'BertIntermediate': '1030.79 GMACs'}
fwd latency - {'BertAttention': '19.83 ms', 'BertOutput': '4.38 ms'}
depth 6:
params - {'LinearActivation': '100.76 M', 'Linear': '100.69 M'}
MACs - {'LinearActivation': '1030.79 GMACs', 'Linear': '1030.79 GMACs'}
fwd latency - {'BertSelfAttention': '16.29 ms', 'LinearActivation': '3.48 ms'}
------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
BertForPreTrainingPreLN(
336.23 M, 100.00% Params, 3139.93 GMACs, 100.00% MACs, 76.39 ms, 100.00% latency, 82.21 TFLOPS,
(bert): BertModel(
335.15 M, 99.68% Params, 3092.96 GMACs, 98.50% MACs, 34.29 ms, 44.89% latency, 180.4 TFLOPS,
(embeddings): BertEmbeddings(...)
(encoder): BertEncoder(
302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.45 ms, 43.79% latency, 184.93 TFLOPS,
(FinalLayerNorm): FusedLayerNorm(...)
(layer): ModuleList(
302.31 M, 89.91% Params, 3092.88 GMACs, 98.50% MACs, 33.11 ms, 43.35% latency, 186.8 TFLOPS,
(0): BertLayer(
12.6 M, 3.75% Params, 128.87 GMACs, 4.10% MACs, 1.29 ms, 1.69% latency, 199.49 TFLOPS,
(attention): BertAttention(
4.2 M, 1.25% Params, 42.97 GMACs, 1.37% MACs, 833.75 us, 1.09% latency, 103.08 TFLOPS,
(self): BertSelfAttention(
3.15 M, 0.94% Params, 32.23 GMACs, 1.03% MACs, 699.04 us, 0.92% latency, 92.22 TFLOPS,
(query): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 182.39 us, 0.24% latency, 117.74 TFLOPS,...)
(key): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 57.22 us, 0.07% latency, 375.3 TFLOPS,...)
(value): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 53.17 us, 0.07% latency, 403.91 TFLOPS,...)
(dropout): Dropout(...)
(softmax): Softmax(...)
)
(output): BertSelfOutput(
1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 114.68 us, 0.15% latency, 187.26 TFLOPS,
(dense): Linear(1.05 M, 0.31% Params, 10.74 GMACs, 0.34% MACs, 64.13 us, 0.08% latency, 334.84 TFLOPS, ...)
(dropout): Dropout(...)
)
)
(PreAttentionLayerNorm): FusedLayerNorm(...)
(PostAttentionLayerNorm): FusedLayerNorm(...)
(intermediate): BertIntermediate(
4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 186.68 us, 0.24% latency, 460.14 TFLOPS,
(dense_act): LinearActivation(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 175.0 us, 0.23% latency, 490.86 TFLOPS,...)
)
(output): BertOutput(
4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 116.83 us, 0.15% latency, 735.28 TFLOPS,
(dense): Linear(4.2 M, 1.25% Params, 42.95 GMACs, 1.37% MACs, 65.57 us, 0.09% latency, 1310.14 TFLOPS,...)
(dropout): Dropout(...)
)
)
...
(23): BertLayer(...)
)
)
(pooler): BertPooler(...)
)
(cls): BertPreTrainingHeads(...)
)
------------------------------------------------------------------------------
在概要分析中,DeepSpeed 算力探查器输出模型的参数数量、浮点运算次数 (flops)、FLOPS、延迟和样本/秒的吞吐量。此概要分析显示当前模型执行与峰值硬件性能之间存在多少性能差距,并帮助用户调整训练或推理设置(例如,超参数、数据并行、模型并行、系统配置等)以获得更好的性能。
DeepSpeed 算力探查器还测量模型架构中不同模型深度的重要模块(聚合概要分析)和特定于模块的概要分析(详细概要分析)。使用这些概要分析,DeepSpeed 用户可以了解每一层或子模块如何影响整体模型复杂度/性能。然后,用户可以调整或重构模型设计以提高性能。例如,使用探查器,DeepSpeed 用户可以定量地判断堆叠较小的层是否比具有较大的层更轻或性能更高。聚合和详细的概要分析还允许用户快速识别瓶颈模块。在上面的 BERT-Large 示例中,使用 DeepSpeed 算力探查器,我们发现 BertLayer 是最重要的层,并且包含相当多的 dropout、softmax 和层归一化以及线性模块。这些模块在 flops 上并不重,会触发许多 GPU 内核调用并创建对内存的过多读/写请求。详细概要分析中显示的模式表明这是内核融合的完美匹配,我们开发了融合的 Transformer 内核以减少数据移动(请参阅 DeepSpeedBert)。应用我们的优化后,我们在 DeepSpeed 算力探查器输出中看到每个 GPU 的 FLOPS 和整体训练样本/秒提高了 25%。
算力探查器可以与 DeepSpeed 运行时一起使用,无需任何用户代码更改,也可以作为独立的软件包独立于 DeepSpeed 使用。当使用 DeepSpeed 进行模型训练时,可以在 DeepSpeed 配置文件 中启用探查器。作为独立的软件包,探查器 API 可用于训练和推理代码。DeepSpeed 探查器仍在积极开发中,仅包含初始功能。敬请关注即将添加的更多激动人心的功能。
算力测量
与现有的 flops 计算工具或方法类似,DeepSpeed 算力探查器测量模块前向传递的 flops,并且反向传递的 flops 估计为前向传递的 2
倍。与计算 PyTorch 运算符 flops 的 PyTorch 探查器不同,DeepSpeed 算力探查器测量模型中模块内的 flops,并为用户提供更多关于模型执行的见解。flops 估计部分受到 ptflops 的启发,主要区别在于 DeepSpeed 算力探查器不仅支持直接在模块级别计算 flops,还可以捕获模块中调用的 torch.nn.functional
以估计 flops。因此,DeepSpeed 算力探查器允许模型中的自定义模块,例如 Megatron-LM 中的 ParallelTransformerLayerworks
、ParallelSelfAttention
、RowParallelLinear
等。这与 ptflops 形成对比,ptflops 需要用户为每个自定义模块编写自定义 flops 计算函数。
多GPU、多节点、数据并行和模型并行
DeepSpeed 算力探查器输出每个 GPU 的概要分析以及世界大小、数据并行大小和模型并行大小。
对于在多 GPU 或多节点上运行的模型,只有模型并行性的更改(例如,Megatron-LM 中的 --model-parallel-size
)会影响概要分析的 flops 和参数数量,即 model_parallel_size * flops = total_flops
和 model_parallel_size * parameters = total_parameters
。数据并行大小或世界大小(与 GPU 或节点的数量相关)不会影响每个 GPU 的概要分析。
用法
算力探查器可以与 DeepSpeed 运行时一起使用,也可以作为独立的软件包使用。当使用 DeepSpeed 进行模型训练时,可以在 deepspeed 配置文件 中配置探查器,而无需用户代码更改。要在 DeepSpeed 运行时之外使用 flops 探查器,请安装 DeepSpeed 并导入 flops_profiler
软件包以直接使用 API。下面给出了每种用法的示例。
与 DeepSpeed 运行时一起使用
当使用 DeepSpeed 进行模型训练时,可以在 deepspeed 配置文件 中配置探查器。无需显式 API 调用即可使用探查器。可以通过将以下字段添加到 DeepSpeed 的配置文件中来启用探查器。有关详细信息,请参阅 flops 探查器。
{
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
示例:Megatron-LM
有关使用 DeepSpeed 运行 Megatron-LM 的信息,请参阅我们的教程 Megatron-LM。
下面显示了 12 层 Megatron-LM 模型 (hidden_size = 8192, num_attention_heads = 32, batch_size = 1024, seq_length = 1024
) 的示例输出。
-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 10:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)
world size: 1
data parallel size: 1
model parallel size: 1
batch size per GPU: 1024
params per gpu: 1.29 M
params of model = params per GPU * mp_size: 1.29 M
fwd MACs per GPU: 41271.95 G
fwd flops per GPU: 82543.9 G
fwd flops of model = fwd flops per GPU * mp_size: 82543.9 G
fwd latency: 1.89 s
bwd latency: 5.38 s
fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 43.68 TFLOPS
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 30.7 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 34.07 TFLOPS
step latency: 34.12 s
iter latency: 41.39 s
samples/second: 24.74
----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
params - {'GPT2Model': '1.29 M'}
MACs - {'GPT2Model': '41271.95 GMACs'}
fwd latency - {'GPT2Model': '1.84 s'}
depth 1:
params - {'TransformerLanguageModel': '1.29 M'}
MACs - {'TransformerLanguageModel': '39584.03 GMACs'}
fwd latency - {'TransformerLanguageModel': '1.83 s'}
depth 2:
params - {'ParallelTransformer': '1.29 M'}
MACs - {'ParallelTransformer': '39584.03 GMACs'}
fwd latency - {'ParallelTransformer': '1.81 s'}
depth 3:
params - {'ModuleList': '1.28 M'}
MACs - {'ModuleList': '39584.03 GMACs'}
fwd latency - {'ModuleList': '1.3 s'}
depth 4:
params - {'ParallelTransformerLayerPart2': '688.15 k'}
MACs - {'ParallelTransformerLayerPart2': '26388.28 GMACs'}
fwd latency - {'ParallelTransformerLayerPart2': '865.73 ms'}
depth 5:
params - {'ParallelMLP': '491.54 k'}
MACs - {'ParallelMLP': '26388.28 GMACs'}
fwd latency - {'ParallelMLP': '849.4 ms'}
------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order:
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs(or latency) and the sum of its submodules'.
1. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
2. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.
GPT2Model(
1.29 M, 100.00% Params, 41271.95 GMACs, 100.00% MACs, 1.84 s, 100.00% latency, 44.78 TFLOPS,
(language_model): TransformerLanguageModel(
1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.83 s, 99.11% latency, 43.34 TFLOPS,
(embedding): Embedding(
2, 0.00% Params, 0 MACs, 0.00% MACs, 18.1 ms, 0.98% latency, 0.0 FLOPS,
(word_embeddings): VocabParallelEmbedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 164.75 us, 0.01% latency, 0.0 FLOPS, )
(position_embeddings): Embedding(1, 0.00% Params, 0 MACs, 0.00% MACs, 489.23 us, 0.03% latency, 0.0 FLOPS, 1024, 8192)
(embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 93.94 us, 0.01% latency, 0.0 FLOPS, p=0.1, inplace=False)
)
(transformer): ParallelTransformer(
1.29 M, 100.00% Params, 39584.03 GMACs, 95.91% MACs, 1.81 s, 98.11% latency, 43.78 TFLOPS,
(layers): ModuleList(
1.28 M, 98.73% Params, 39584.03 GMACs, 95.91% MACs, 1.3 s, 70.66% latency, 60.79 TFLOPS,
(0): ParallelTransformerLayerPart1(
49.15 k, 3.80% Params, 1099.65 GMACs, 2.66% MACs, 23.5 ms, 1.27% latency, 93.6 TFLOPS,
(input_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 128.75 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
(attention): ParallelSelfAttention(
32.77 k, 2.53% Params, 1099.65 GMACs, 2.66% MACs, 22.8 ms, 1.24% latency, 96.46 TFLOPS,
(query_key_value): ColumnParallelLinear(24.58 k, 1.90% Params, 824.63 GMACs, 2.00% MACs, 8.93 ms, 0.48% latency, 184.7 TFLOPS, )
(scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 134.22 MMACs, 0.00% MACs, 151.16 us, 0.01% latency, 1.78 TFLOPS, )
(attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.63 us, 0.00% latency, 0.0 FLOPS, p=0.1, inplace=False)
(dense): RowParallelLinear(8.19 k, 0.63% Params, 274.88 GMACs, 0.67% MACs, 2.67 ms, 0.14% latency, 205.81 TFLOPS, )
)
)
(1): ParallelTransformerLayerPart2(
57.35 k, 4.43% Params, 2199.02 GMACs, 5.33% MACs, 77.53 ms, 4.21% latency, 56.73 TFLOPS,
(post_attention_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 116.11 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
(mlp): ParallelMLP(
40.96 k, 3.16% Params, 2199.02 GMACs, 5.33% MACs, 76.19 ms, 4.13% latency, 57.72 TFLOPS,
(dense_h_to_4h): ColumnParallelLinear(32.77 k, 2.53% Params, 1099.51 GMACs, 2.66% MACs, 10.79 ms, 0.59% latency, 203.81 TFLOPS, )
(dense_4h_to_h): RowParallelLinear(8.19 k, 0.63% Params, 1099.51 GMACs, 2.66% MACs, 14.38 ms, 0.78% latency, 152.95 TFLOPS, )
)
)
...
(23): ParallelTransformerLayerPart2(...)
)
(final_layernorm): FusedLayerNorm(16.38 k, 1.27% Params, 0 MACs, 0.00% MACs, 110.86 us, 0.01% latency, 0.0 FLOPS, torch.Size([8192]), eps=1e-05, elementwise_affine=True)
)
)
)
------------------------------------------------------------------------------
在 DeepSpeed 运行时之外使用
探查器可以用作 DeepSpeed 运行时之外的独立软件包。只需安装 DeepSpeed 并导入 flops_profiler
软件包即可直接使用 API。有关安装 DeepSpeed,请参阅 DeepSpeed 的安装。
在模型推理中
要分析推理中训练好的模型,请使用 get_model_profile
函数。下面给出了示例。
示例:AlexNet
以下示例显示了如何使用 DeepSpeed flops 探查器分析 AlexNet。
import torchvision.models as models
import torch
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator
with get_accelerator().device(0):
model = models.alexnet()
batch_size = 256
flops, macs, params = get_model_profile(model=model, # model
input_shape=(batch_size, 3, 224, 224), # input shape to the model. If specified, the model takes a tensor with this shape as the only positional argument.
args=None, # list of positional arguments to the model.
kwargs=None, # dictionary of keyword arguments to the model.
print_profile=True, # prints the model graph with the measured profile attached to each module
detailed=True, # print the detailed profile
module_depth=-1, # depth into the nested modules, with -1 being the inner most modules
top_modules=1, # the number of top modules to print aggregated profile
warm_up=10, # the number of warm-ups before measuring the time of each module
as_string=True, # print raw numbers (e.g. 1000) or as human-readable strings (e.g. 1k)
output_file=None, # path to the output file. If None, the profiler prints to stdout.
ignore_modules=None) # the list of modules to ignore in the profiling
示例:Bert
from functools import partial
import torch
from transformers import BertForSequenceClassification, BertTokenizer
from deepspeed.profiling.flops_profiler import get_model_profile
from deepspeed.accelerator import get_accelerator
def bert_input_constructor(batch_size, seq_len, tokenizer):
fake_seq = ""
for _ in range(seq_len - 2): # ignore the two special tokens [CLS] and [SEP]
fake_seq += tokenizer.pad_token
inputs = tokenizer([fake_seq] * batch_size,
padding=True,
truncation=True,
return_tensors="pt")
labels = torch.tensor([1] * batch_size)
inputs = dict(inputs)
inputs.update({"labels": labels})
return inputs
with get_accelerator().device(0):
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
batch_size = 4
seq_len = 128
enable_profile = True
if enable_profile:
flops, macs, params = get_model_profile(
model,
kwargs=bert_input_constructor(batch_size, seq_len, tokenizer),
print_profile=True,
detailed=True,
)
else:
inputs = bert_input_constructor((batch_size, seq_len), tokenizer)
outputs = model(inputs)
在模型训练工作流程中
要在训练工作流程中分析模型前向传播,请使用 FlopsProfiler
类。 FlopsProfiler
类提供以下方法
start_profile()
- 开始分析get_total_flops(as_string=False)
- 返回模型中浮点运算的总数get_total_macs(as_string=False)
- 返回模型中 MAC 的总数get_total_params(as_string=False)
- 返回模型中参数的总数print_model_profile(profile_step=1, module_depth=-1, top_modules=3, detailed=True, output_file=None)
- 打印模型概要分析stop_profile()
- 停止分析。这将停止模型中的 flops 计数。end_profile()
- 清理。这将清理在分析期间添加到模型的概要分析属性。这应该在分析结束时以及在get_total_flops
、get_total_params
或print_model_profile
之后调用。
示例训练工作流程
以下是典型训练工作流程中此用法的示例。
from deepspeed.profiling.flops_profiler import FlopsProfiler
model = Model()
prof = FlopsProfiler(model)
profile_step = 5
print_profile= True
for step, batch in enumerate(data_loader):
# start profiling at training step "profile_step"
if step == profile_step:
prof.start_profile()
# forward() method
loss = model(batch)
# end profiling and print output
if step == profile_step: # if using multi nodes, check global_rank == 0 as well
prof.stop_profile()
flops = prof.get_total_flops()
macs = prof.get_total_macs()
params = prof.get_total_params()
if print_profile:
prof.print_model_profile(profile_step=profile_step)
prof.end_profile()
# runs backpropagation
loss.backward()
# weight update
optimizer.step()