开源周期级UPMEM指令模拟器--uPimulator

站长

2024年04月25日 19:49 · 阅读数 331

引言

uPIMulator是由KAIST的Bongjoon Hyun等人开发的用于UPMEM DPU架构的周期级硬件模拟器，于2024年3月4日在HPCA 2024发布并获最佳论文奖。项目地址：github.com/VIA-Researc… 原论文地址：arxiv.org/pdf/2308.00…

uPimulator介绍

开源周期级UPMEM指令模拟器--uPimulator

uPIMulator是一个集成了UPMEM SDK的基于LLVM的编译工具链和自主研发的周期级硬件表现模拟器，提供了脱离UPMEM SDK附带模拟器受真实硬件结构影响的限制，探索硬件结构上拓展的可能性。

软件编译工具链

开源的UPMEM SDK编译工具（dpu-upmem-dpurte-clang）与标准c语言编译器工作流程相同：接受程序员编写的源代码和兼容UPMEM-PIM的glibc风格C库（如使用用于DPU Wram的mem_alloc替代malloc），进行预处理、编译和汇编成二进制对象，最后链接成一个UPMEM-PIM二进制可执行文件。

开源周期级UPMEM指令模拟器--uPimulator

uPIMulator利用了UPMEM SDK的预处理器和编译器将dpu端的源代码和dpu端的glibc风格C库编译至汇编级代码，然后送入uPIMulator自定义的链接器（论文和源代码中的Linker），将两者结合进行词法、语法分析和分析活性，最终生成二进制的链接完成的程序并根据mram、wram、iram完成转储。

汇编器（论文和源代码中的Assembler）基于选择的benchmark类型，随机生成测试数据并根据在dpu中的存储位置完成转储。

硬件表现模拟器

默认参数

DPU processor architecture
Operating frequency	350 MHz
Number of pipeline stages	14
Revolver scheduling cycles	11
WRAM / IRAM size	64 KB / 24 KB
WRAM / IRAM access latency	1 cycle
WRAM / IRAM access granularity	4 / 6 B per clock
WRAM / IRAM access bandwidth	1,400 / 2,100 MB/sec
Atomic memory size	256 Bits

DRAM system
MRAM size	64 MB
DDR specification	DDR4-2400
Memory scheduling policy	FR-FCFS
Row buffer size	1 KB
tRCD, tRAS, tRP, tCL, tBL	16, 39, 16, 16, 4 cycles

Communication
CPU→DPU bandwidth (per rank)	0.296 GB/s per DPU
CPU←DPU bandwidth (per rank)	0.063 GB/s per DPU

Software architecture
Number of general-purpose registers	24
Maximum number of threads	24
Stack size (per thread)	2 KB
Heap size	4 KB

测试结果

#!/bin/bash

# 设置uPIMulator的根目录和二进制文件目录的路径 
ROOT_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator" 
BIN_DIRPATH="/home/asong/桌面/uPIMulator/golang/uPIMulator/bin" 

# 设置基准测试名称和其他参数 
VERBOSE=0 
BENCHMARK="VA" 
NUM_CHANNELS=1 
NUM_RANKS_PER_CHANNEL=1 
NUM_DPUS_PER_RANK=1 
NUM_TASKLETS=1 
DATA_PREP_PARAMS=2048 

# 检查bin目录是否存在，如果不存在，则创建它 
rm -rf "${BIN_DIRPATH}" 
mkdir "${BIN_DIRPATH}" 

# 执行uPIMulator命令 
./build/uPIMulator --verbose $VERBOSE \ 
                   --root_dirpath $ROOT_DIRPATH \ 
                   --bin_dirpath $BIN_DIRPATH \ 
                   --benchmark $BENCHMARK \ 
                   --num_channels $NUM_CHANNELS \ 
                   --num_ranks_per_channel $NUM_RANKS_PER_CHANNEL \ 
                   --num_dpus_per_rank $NUM_DPUS_PER_RANK \ 
                   --num_tasklets $NUM_TASKLETS \ 
                   --data_prep_params $DATA_PREP_PARAMS

使用如上的shell脚本，通过修改channel，rank，dpu和dpu内tasklet的数量或修改数据集的大小，可以分析dpu运行时的表现或执行过程的瓶颈，单一测试输出结果如下所示

NUM_CHANNELS=1
NUM_RANKS_PER_CHANNEL=1
NUM_DPUS_PER_RANK=1
NUM_TASKLETS=1
DATA_PREP_PARAMS=524288

ThreadScheduler[0_0_0]_breakdown_etc: 37233714
ThreadScheduler[0_0_0]_breakdown_run: 3723370
ThreadScheduler[0_0_0]_breakdown_dma: 4489211
Logic[0_0_0]_num_instructions: 3723370
Logic[0_0_0]_active_tasklets_0: 4556810
Logic[0_0_0]_active_tasklets_1: 40889485
Logic[0_0_0]_logic_cycle: 45446295
CycleRule[0_0_0]_cycle_rule: 20497
MemoryController[0_0_0]_memory_cycle: 272677770
MemoryScheduler[0_0_0]_num_fcfs: 743424
MemoryScheduler[0_0_0]_num_fr: 32768
RowBuffer[0_0_0]_num_activations: 10240
RowBuffer[0_0_0]_num_precharges: 10239
RowBuffer[0_0_0]_num_writes: 262144
RowBuffer[0_0_0]_write_bytes: 2097152
RowBuffer[0_0_0]_num_reads: 524288
RowBuffer[0_0_0]_read_bytes: 4194304

IPC（instruction per cycle）

计算公式：IPC = (value of num_instructions) / (value of logic_cycle)

开源周期级UPMEM指令模拟器--uPimulator

Breakdown of DPU’s runtime

计算公式：

Issuable ratio = (value of breakdown_run) / (value of logic_cycle)
Idle (Memory) ratio = (value of breakdown_dma) / (value of logic_cycle)
Idle (Revolver) ratio = (value of breakdown_etc) / (value of logic_cycle)
Idle (RF) ratio = (value of backpressure) / (value of logic_cycle)

开源周期级UPMEM指令模拟器--uPimulator

转载自:https://juejin.cn/post/7360903734853517364