开源周期级UPMEM指令模拟器--uPimulator
引言
uPIMulator是由KAIST的Bongjoon Hyun等人开发的用于UPMEM DPU架构的周期级硬件模拟器,于2024年3月4日在HPCA 2024发布并获最佳论文奖。 项目地址:github.com/VIA-Researc… 原论文地址:arxiv.org/pdf/2308.00…
uPimulator介绍
uPIMulator是一个集成了UPMEM SDK的基于LLVM的编译工具链和自主研发的周期级硬件表现模拟器,提供了脱离UPMEM SDK附带模拟器受真实硬件结构影响的限制,探索硬件结构上拓展的可能性。
软件编译工具链
开源的UPMEM SDK编译工具(dpu-upmem-dpurte-clang)与标准c语言编译器工作流程相同:接受程序员编写的源代码和兼容UPMEM-PIM的glibc风格C库(如使用用于DPU Wram的mem_alloc替代malloc),进行预处理、编译和汇编成二进制对象,最后链接成一个UPMEM-PIM二进制可执行文件。
uPIMulator利用了UPMEM SDK的预处理器和编译器将dpu端的源代码和dpu端的glibc风格C库编译至汇编级代码,然后送入uPIMulator自定义的链接器(论文和源代码中的Linker),将两者结合进行词法、语法分析和分析活性,最终生成二进制的链接完成的程序并根据mram、wram、iram完成转储。
汇编器(论文和源代码中的Assembler)基于选择的benchmark类型,随机生成测试数据并根据在dpu中的存储位置完成转储。
硬件表现模拟器
默认参数
DPU processor architecture | |
---|---|
Operating frequency | 350 MHz |
Number of pipeline stages | 14 |
Revolver scheduling cycles | 11 |
WRAM / IRAM size | 64 KB / 24 KB |
WRAM / IRAM access latency | 1 cycle |
WRAM / IRAM access granularity | 4 / 6 B per clock |
WRAM / IRAM access bandwidth | 1,400 / 2,100 MB/sec |
Atomic memory size | 256 Bits |
DRAM system | |
---|---|
MRAM size | 64 MB |
DDR specification | DDR4-2400 |
Memory scheduling policy | FR-FCFS |
Row buffer size | 1 KB |
tRCD, tRAS, tRP, tCL, tBL | 16, 39, 16, 16, 4 cycles |
Communication | |
---|---|
CPU→DPU bandwidth (per rank) | 0.296 GB/s per DPU |
CPU←DPU bandwidth (per rank) | 0.063 GB/s per DPU |
Software architecture | |
---|---|
Number of general-purpose registers | 24 |
Maximum number of threads | 24 |
Stack size (per thread) | 2 KB |
Heap size | 4 KB |
测试结果
#!/bin/bash
# 设置uPIMulator的根目录和二进制文件目录的路径
ROOT_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator"
BIN_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator/bin"
# 设置基准测试名称和其他参数
VERBOSE=0
BENCHMARK="VA"
NUM_CHANNELS=1
NUM_RANKS_PER_CHANNEL=1
NUM_DPUS_PER_RANK=1
NUM_TASKLETS=1
DATA_PREP_PARAMS=2048
# 检查bin目录是否存在,如果不存在,则创建它
rm -rf "${BIN_DIRPATH}"
mkdir "${BIN_DIRPATH}"
# 执行uPIMulator命令
./build/uPIMulator --verbose $VERBOSE \
--root_dirpath $ROOT_DIRPATH \
--bin_dirpath $BIN_DIRPATH \
--benchmark $BENCHMARK \
--num_channels $NUM_CHANNELS \
--num_ranks_per_channel $NUM_RANKS_PER_CHANNEL \
--num_dpus_per_rank $NUM_DPUS_PER_RANK \
--num_tasklets $NUM_TASKLETS \
--data_prep_params $DATA_PREP_PARAMS
使用如上的shell脚本,通过修改channel,rank,dpu和dpu内tasklet的数量或修改数据集的大小,可以分析dpu运行时的表现或执行过程的瓶颈,单一测试输出结果如下所示
NUM_CHANNELS=1
NUM_RANKS_PER_CHANNEL=1
NUM_DPUS_PER_RANK=1
NUM_TASKLETS=1
DATA_PREP_PARAMS=524288
ThreadScheduler[0_0_0]_breakdown_etc: 37233714
ThreadScheduler[0_0_0]_breakdown_run: 3723370
ThreadScheduler[0_0_0]_breakdown_dma: 4489211
Logic[0_0_0]_num_instructions: 3723370
Logic[0_0_0]_active_tasklets_0: 4556810
Logic[0_0_0]_active_tasklets_1: 40889485
Logic[0_0_0]_logic_cycle: 45446295
CycleRule[0_0_0]_cycle_rule: 20497
MemoryController[0_0_0]_memory_cycle: 272677770
MemoryScheduler[0_0_0]_num_fcfs: 743424
MemoryScheduler[0_0_0]_num_fr: 32768
RowBuffer[0_0_0]_num_activations: 10240
RowBuffer[0_0_0]_num_precharges: 10239
RowBuffer[0_0_0]_num_writes: 262144
RowBuffer[0_0_0]_write_bytes: 2097152
RowBuffer[0_0_0]_num_reads: 524288
RowBuffer[0_0_0]_read_bytes: 4194304
IPC(instruction per cycle)
计算公式:IPC = (value of num_instructions
) / (value of logic_cycle
)
Breakdown of DPU’s runtime
计算公式:
- Issuable ratio = (value of
breakdown_run
) / (value oflogic_cycle
) - Idle (Memory) ratio = (value of
breakdown_dma
) / (value oflogic_cycle
) - Idle (Revolver) ratio = (value of
breakdown_etc
) / (value oflogic_cycle
) - Idle (RF) ratio = (value of
backpressure
) / (value oflogic_cycle
)
转载自:https://juejin.cn/post/7360903734853517364