Некоторое время назад появились тесты процессора Байкал-S, поэтому я решил сравнить
производительность данного процессора с китайским процессором Kunpeng 920 (920-4826), к которому некоторое время назад получил доступ.
Характеристики сравниваемых процессоров:
|
Байкал-S |
Kunpeng 920 |
Архитектура |
aarch64 |
aarch64 |
ISA |
ARMv8.2-A |
ARMv8.2-A |
Микроархитектура |
Cortex-A75 |
TaiShan v110 |
Частота (МГц) |
2000 |
2600 |
Ядра; Потоки |
48 |
48 |
Техпроцесс (нм) |
16 |
7 |
TDP (Вт) |
100 |
150 |
Тип ОЗУ |
DDR4-3200 |
DDR4-2933 |
Сокет |
FCLGA-3467 |
BGA |
Каналов ОЗУ |
6 |
8 |
Макс ОЗУ (ГБ) |
768 |
1024 |
ГФлопс (DP) |
384 |
500 |
ГФлопс (SP) |
768 |
2000 |
Год |
2021 |
2019 |
Были проведены следующие тесты:
-
7zip встроенный бенчмарк
-
Dhrystone, Whetsone
-
Coremark
-
Scimark 2
-
HPL
-
Stream
-
Stockfish
-
Blender
-
Geekbench 5
Немедленно переходим к результатам, детали тестов смотрите далее.
Результаты
|
|
Байкал-S |
Kunpeng 920 |
||
Тест |
Единица измерения |
Результат |
Результат |
||
1 поток |
48 потоков |
1 поток |
96 (48) потоков |
||
CoreMark |
|
13 540 |
647 292 |
18 398 |
1 291 450 (874 104) |
Dhrystone |
DMIPS |
13 410 |
|
21 830 |
|
Whetstone |
MWIPS |
4 702 |
225 078 |
3 857 |
362 558 |
HPL 2.3 |
|
7 |
294 |
|
327 (194) |
7z |
Total; Compress; Decompress |
2 378; 2 400; 2 356 |
90 039; 71 542; 108 536 |
|
194 573; 150 105 ; 239 042 (119 705; 101 272; 138 139) |
STREAM |
МБ/с Copy; Scale; Add; Triad |
|
81 787; 78 037; 76 156; 75 113 |
|
93 271; 75 846; 78 009; 79 063 |
Blender |
Секунд |
16,96 |
|
|
8,2 (13,56) |
Stockfish |
Nodes/second |
|
43 097 684 |
|
77 469 854 |
Geekbench 5 [2] |
|
405 |
13 549 |
599 |
28 112 (18 817) |
Немного об архитектурах процессоров Байкал-S и Kunpeng 920
Байкал-S
Байкал-S — процессор на основе 64 разрядной RISC архитектуре ARM (armv8, aarch64), работает на частоте 2 ГГц (возможна 2,5 ГГц) имеет 48 ядер, которые реализуют микроархитектуру ядер Cortex-A75.
Особенности ядер процессора Байкал-S на основе микроархитектуры Cortex-A75:
-
64 битная архитектура armv8.2
-
FP/SIMD расширения VFPv4 и NEON
-
Внеочередное исполнение
-
Предсказание ветвлений
-
Поддержка виртуализации
-
Конвейер 11 – 13 стадий
-
8 портов на исполнение микроопераций:
-
2 Загрузки/Сохранения
-
2 целых АЛУ (сложение, сдвиг)
-
1 блок ветвлений
-
1 АЛУ для умножения, деления
-
2 блока SIMD/FPU
-
-
3 уровневый декодер команд (6 микроинструкций за цикл)
-
Кеши
-
64 КБ L1 кэш команд (4 канальный, ассоциативный, размер линии 64 байта)
-
64 КБ L1 кэш данных (4 канальный, ассоциативный, размер линии 64 байта)
-
В Байкал-S 512 КБ L2 на 1 ядро (4 ядра в кластере), в сумме 24 МБ
-
Кэш L3: 24 МБ (2 МБ в кластере)
-
Кэш L4: 32 МБ
-
Kunpeng 920
Kunpeng 920 — процессор разработанный команией HiSilicon (производит для Huawei) процессор на основе глубоко модифицированной микроархитектуре Arm Cortex-A72 (TaiShan V110). Существуют 32, 48 и 64 ядерные модели (3226, 4826, 6426).
Особенности ядер TaiShan V110 (для 48 ядер):
-
64 битная архитектура armv8.2
-
FP/SIMD расширения VFPv4 и NEON
-
Внеочередное исполнение
-
Улучшенный предсказатель ветвлений
-
Поддержка виртуализации
-
Конвейер 11 – 13 стадий
-
8 портов на исполнение микроопераций:
-
2 Загрузки/Сохранения
-
2 целых АЛУ (сложение, сдвиг)
-
1 блок ветвлений
-
1 АЛУ для умножения, деления
-
2 блока SIMD/FPU
-
-
4 уровневый декодер команд (3 уровня в Cortex-A72)
-
Кеши
-
64 КБ L1 кэш команд (4 канальный, ассоциативный, размер линии 64 байта)
-
64 КБ L1 кэш данных (4 канальный, ассоциативный, размер линии 64 байта)
-
В Kunpeng 920 512 КБ L2 на 1 ядро (4 ядра в кластере), в сумме 24 МБ
-
Кэш L3: 48 МБ (4 МБ в кластере)
-
Разбираем результаты
Coremark
Современный тест, который должен заменить Dhrystone и Whetstone. Написан на C. Считает различные массивы, матрицы, сортировка массивов и т. д. Предназначался для запуска на всём: от микроконтроллеров до мощных процессоров.
Вывод однопоточного теста Coremark процессора Байкал-S
./coremark.exe 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 14771 Total time (secs): 14.771000 Iterations/Sec : 13540.044682 Iterations : 200000 Compiler version : GCC10.2.1 20210110 Compiler flags : -static -Ofast -funroll-all-loops -EL -falign-functions=16 -DPERFORMANCE_RUN=1 -lrt Memory location : Please put data memory location here (e.g. code in flash. data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0x4983 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 13540.044682 / GCC10.2.1 20210110 -static -Ofast -funroll-all- loops -EL -falign-functions=16 -DPERFORMANCE_RUN=1 -lrt / Heap
Вывод многопоточного теста Coremark процессора Байкал-S
./coremarkMT.exe 2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 14815 Total time (secs): 14.815000 Iterations/Sec : 647991.900101 Iterations : 9600000 Compiler version : GCC10.2.1 20210110 Compiler flags : -static -Ofast -funroll-all-loops -EL -falign-functions=16 -DUSE_PHTREAD -DMULTITHREAD=48 -DPERFORMANCE_RUN=1 -lrt Parallel PThreads : 48 Memory location : Please put data memory location here (e.g. code in flash. data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [1]crclist : 0xe714 ..... [46]crclist : 0xe714 [47]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [1]crcmatrix : 0x1fd7 ..... [46]crcmatrix : 0x1fd7 [47]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [1]crcstate : 0x8e3a [2]crcstate : 0x8e3a ..... [46]crcstate : 0x8e3a [47]crcstate : 0x8e3a [0]crcfinal : 0x4983 [1]crcfinal : 0x4983 [2]crcfinal : 0x4983 ..... [46]crcfinal : 0x4983 [47]crcfinal : 0x4983 Correct operation validated. See readme.txt for run and reporting rules. CoreMark 1.0 : 647991.900101 / GCC10.2.1 20210110 -static -Ofast -funroll- all-loops -EL -falign-functions=16 -DUSE_PHTREAD -DMULTITHREAD=48 - DPERFORMANCE_RUN=1 -lrt / Heap / 48:PThreads 2K performance run parameters for coremark.
Вывод однопоточного теста Coremark процессора Kunpeng 920
2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 16306 Total time (secs): 16.306000 Iterations/Sec : 18398.135656 Iterations : 300000 Compiler version : GCC9.3.0 Compiler flags : -Ofast -march=native -DPERFORMANCE_RUN=1 -DUSE_FORK=1 -lrt Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [0]crcmatrix : 0x1fd7 [0]crcstate : 0x8e3a [0]crcfinal : 0xcc42 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 18398.135656 / GCC9.3.0 -Ofast -march=native -DPERFORMANCE_RUN=1 -DUSE_FORK=1 -lrt / Heap
Вывод многопоточного теста Coremark процессора Kunpeng 920
2K performance run parameters for coremark. CoreMark Size : 666 Total ticks : 14867 Total time (secs): 14.867000 Iterations/Sec : 1291450.864330 Iterations : 19200000 Compiler version : GCCUbuntu Clang 12.0.0 Compiler flags : -Ofast -march=armv8.2-a -DPERFORMANCE_RUN=1 -DUSE_FORK=1 -lrt Parallel Fork : 96 Memory location : Please put data memory location here (e.g. code in flash, data on heap etc) seedcrc : 0xe9f5 [0]crclist : 0xe714 [95]crcfinal : 0x4983 Correct operation validated. See README.md for run and reporting rules. CoreMark 1.0 : 1291450.864330 / GCCUbuntu Clang 12.0.0 -Ofast -march=armv8.2-a -DPERFORMANCE_RUN=1 -DUSE_FORK=1 -lrt / Heap / 96:Fork
Dhrystone
Dhrystone достаточно древний тест 80х годов, написан на C. Тестирует целочисленную арифметику и работу со строками. Результаты измеряются в Dhrystone/s и DMIPS. (DMIPS = Dhrystone/s делить на 1757).
Вывод теста Dhrystone процессора Байкал-S
##################################################### Dhrystone Benchmark 2.1 arm armv8.1-a optimized, Thu Feb 4 19:41:16 2010 Nanoseconds one Dhrystone run: 42.44 Dhrystones per Second: 23561181 VAX MIPS rating = 13409.89 Numeric results were correct
Вывод теста Dhrystone процессора Kunpeng 920
Dhrystone Benchmark, Version 2.1 (Language: C or C++) Optimisation aarch64 armv8.2-a optimized Register option not selected 10000 runs 0.00 seconds 100000 runs 0.00 seconds 1000000 runs 0.03 seconds 10000000 runs 0.28 seconds 20000000 runs 0.55 seconds 40000000 runs 1.10 seconds 80000000 runs 2.19 seconds Final values (* implementation-dependent): Int_Glob: O.K. 5 Bool_Glob: O.K. 1 Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 80000010 Ptr_Glob-> Ptr_Comp: * 0xc0cf2a0 Discr: O.K. 0 Enum_Comp: O.K. 2 Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Next_Ptr_Glob-> Ptr_Comp: * 0xc0cf2a0 same as above Discr: O.K. 0 Enum_Comp: O.K. 1 Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13 Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING Nanoseconds one Dhrystone run: 27.32 Dhrystones per Second: 36598414 VAX MIPS rating = 20830.06
Whetstone
Тестирует арифметику с плавающей/фиксированной запятой, математические функции, ветвления, вызовов функций, присваиваний, работы с числами с фиксированной запятой, ветвлений. Результаты измеряются в MMIPS.
Вывод однопоточного теста Whetstone процессора Байкал-S
======================================================================= BYTE UNIX Benchmarks (Version 5.1.3) System: dbs: GNU/Linux OS: GNU/Linux -- 5.4.197-baikal-arm64 -- #72 SMP PREEMPT Mon Sep 06 20:59:55 MSK 2022 Machine: aarch64 (unknown) Language: en_US.utf8 (charmap="UTF-8". collate="UTF-8") 17:40:34 up 38 min. 2 users. load average: 8.44. 17.34. 9.11; runlevel 5 ------------------------------------------------------------------------ Benchmark Run: Sep 09 2022 17:40:34 - 17:45:20 0 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 23561181.9 lps (10.0 s. 7 samples) Double-Precision Whetstone 4702.4 MWIPS (9.9 s. 7 samples)
Вывод многопоточного теста Whetstone процессора Байкал-S
BYTE UNIX Benchmarks (Version 5.1.3) System: dbs: GNU/Linux OS: GNU/Linux -- 5.4.197-baikal-arm64 -- #72 SMP PREEMPT Mon Jun 21 20:59:55 MSK 2022 Machine: aarch64 (unknown) Language: en_US.utf8 (charmap="UTF-8". collate="UTF-8") 17:34:20 up 32 min. 2 users. load average: 0.01. 0.01. 0.00; runlevel 5 ------------------------------------------------------------------------ Benchmark Run: Sep 09 2022 17:34:20 - 17:39:07 0 CPUs in system; running 48 parallel copies of tests Dhrystone 2 using register variables 1127397973.8 lps (10.0 s . 7 samples) Double-Precision Whetstone 225078.7 MWIPS (9.9 s. 7 samples)
Вывод однопоточного теста Whetstone процессора Kunpeng 920
########################################## Single Precision C Whetstone Benchmark aarch64 armv8-a optimized, Tue Jan 18 14:59:09 2022 Calibrate 0.01 Seconds 1 Passes (x 100) 0.02 Seconds 5 Passes (x 100) 0.06 Seconds 25 Passes (x 100) 0.32 Seconds 125 Passes (x 100) 1.62 Seconds 625 Passes (x 100) 8.11 Seconds 3125 Passes (x 100) Use 3854 passes (x 100) Single Precision C/C++ Whetstone Benchmark Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 831.272 0.089 N2 floating point -1.12274742126464844 804.396 0.644 N3 if then else 1.00000000000000000 3895.706 0.102 N4 fixed point 12.00000000000000000 62234.039 0.020 N5 sin,cos etc. 0.49911010265350342 115.110 2.786 N6 floating point 0.99999982118606567 519.520 4.001 N7 assignments 3.00000000000000000 2597.914 0.274 N8 exp,sqrt etc. 0.75110864639282227 69.101 2.075 MWIPS 3857.517 9.991
Вывод многопоточного теста Whetstone процессора Kunpeng 920
MWIPS 362558 Based on time for last thread to finish
HPL
HPL – переносимый высокопроизводительный тест, который используется для суперкомпьютеров. Решает системы линейных уравнений, использует библиотеки BLAS и MPI. Результаты выдаёт в GFLOPS.
Вывод теста HPL процессора Байкал-S
mpirun.openmpi -np 48 -- allow-run-as-root ./xhpl ============================================================================= HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2. 2018 Written by A. Petitet and R. Clint Whaley. Innovative Computing Laboratory. UTK Modified by Piotr Luszczek. Innovative Computing Laboratory. UTK Modified by Julien Langou. University of Colorado Denver ============================================================================= An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 98304 NB : 96 PMAP : Row-major process mapping P : 6 Q : 8 PFACT : Left NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words ----------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ============================================================================= T/V N NB P Q Time Gflops ----------------------------------------------------------------------------- WR00C2L4 98304 96 6 8 2149.68 2.9462e+02 HPL_pdgesv() start time Wed Aug 17 18:56:05 2022 HPL_pdgesv() end time Wed Aug 17 19:31:55 2022 ----------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.19811356e-01 ...... PASSED =============================================================================
Вывод теста HPL процессора Kunpeng 920
$ mpiexec -n 48 ./xhpl HPL ERROR from process # 0, on line 419 of function HPL_pdinfo: >>> Need at least 96 processes for these tests <<< HPL ERROR from process # 0, on line 621 of function HPL_pdinfo: >>> Illegal input in file HPL.dat. Exiting ... <<< user@iot:~/hpl/bin/linux$ nano HPL.dat user@iot:~/hpl/bin/linux$ mpiexec -n 48 ./xhpl ================================================================================ HPLinpack 2.3 -- High-Performance Linpack benchmark -- December 2, 2018 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 18000 24000 36000 40800 NB : 240 PMAP : Row-major process mapping P : 1 Q : 48 PFACT : Left NBMIN : 4 NDIV : 4 RFACT : Left BCAST : 1ring DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 16 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L4L4 18000 240 1 48 29.67 1.3106e+02 HPL_pdgesv() start time Wed Jan 26 22:02:33 2022 HPL_pdgesv() end time Wed Jan 26 22:03:02 2022 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.42218782e-03 ...... PASSED ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L4L4 24000 240 1 48 55.00 1.6757e+02 HPL_pdgesv() start time Wed Jan 26 22:03:06 2022 HPL_pdgesv() end time Wed Jan 26 22:04:01 2022 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 1.13864820e-03 ...... PASSED ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L4L4 36000 240 1 48 164.06 1.8960e+02 HPL_pdgesv() start time Wed Jan 26 22:04:09 2022 HPL_pdgesv() end time Wed Jan 26 22:06:54 2022 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 9.12220668e-04 ...... PASSED ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10L4L4 40800 240 1 48 233.34 1.9406e+02 HPL_pdgesv() start time Wed Jan 26 22:07:05 2022 HPL_pdgesv() end time Wed Jan 26 22:10:58 2022 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 9.95497593e-04 ...... PASSED ================================================================================ Finished 4 tests with the following results: 4 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. -------------------------------------------------------------------------------- End of Tests. ================================================================================
7zip
Встроенный тест архиватора. Тест особо не параллелится, результаты примерно равные на этих процессорах (частота одинаковая).
Вывод теста 7z процессора Байкал-S
root@dbs:~# 7za b 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,48 CPUs LE) LE CPU Freq: 64000000 - - - - - - - - RAM size: 96133 MB, # CPU hardware threads: 48 RAM usage: 10590 MB, # Benchmark threads: 48 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 75097 4088 1787 73055 | 1295732 4780 2312 110496 23: 68987 3987 1763 70290 | 1263902 4784 2286 109363 24: 70152 4293 1757 75428 | 1244122 4783 2283 109200 25: 59027 4190 1610 67395 | 1180858 4659 2256 105084 ---------------------------------- | ------------------------------ Avr: 4139 1729 71542 | 4752 2284 108536 Tot: 4446 2007 90039 root@dbs:~# 7za b -mmt1 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,48 CPUs LE) LE CPU Freq: - - - - - - - - - RAM size: 96133 MB, # CPU hardware threads: 48 RAM usage: 435 MB, # Benchmark threads: 1 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 2664 100 2593 2592 | 27461 100 2345 2345 23: 2377 100 2423 2422 | 27216 100 2356 2356 24: 2153 100 2317 2316 | 26935 100 2365 2365 25: 1988 100 2271 2270 | 26524 100 2361 2361 ---------------------------------- | ------------------------------ Avr: 100 2401 2400 | 100 2357 2356 Tot: 100 2379 2378
Вывод теста 7z процессора Kunpeng 920
7zr b 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,96 CPUs LE) LE CPU Freq: - - - - - - - - - RAM size: 321420 MB, # CPU hardware threads: 96 RAM usage: 21180 MB, # Benchmark threads: 96 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 150159 7512 1945 146075 | 2459507 6909 3037 209731 23: 135991 7105 1950 138559 | 3118221 9048 2984 269812 24: 138858 6878 2171 149301 | 3013256 8945 2958 264461 25: 145816 7561 2203 166488 | 2384245 7438 2853 212162 ---------------------------------- | ------------------------------ Avr: 7264 2067 150105 | 8085 2958 239042 Tot: 7675 2513 194573 $ 7zr b -mmt48 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,96 CPUs LE) LE CPU Freq: - 64000000 64000000 - 128000000 - 512000000 - - RAM size: 321420 MB, # CPU hardware threads: 96 RAM usage: 10590 MB, # Benchmark threads: 48 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 104046 4168 2429 101216 | 1691792 4670 3089 144270 23: 93868 4477 2136 95641 | 1613868 4618 3024 139645 24: 97195 4096 2552 104505 | 1588574 4629 3012 139434 25: 90848 4125 2515 103728 | 1451915 4396 2939 129206 ---------------------------------- | ------------------------------ Avr: 4217 2408 101272 | 4578 3016 138139 Tot: 4397 2712 119705 $ 7zr b -mmt1 7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,96 CPUs LE) LE CPU Freq: - - - - - - - - - RAM size: 321420 MB, # CPU hardware threads: 96 RAM usage: 435 MB, # Benchmark threads: 1 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 3620 100 3527 3522 | 37243 100 3183 3180 23: 3226 100 3293 3288 | 36950 100 3201 3198 24: 2872 100 3092 3088 | 36444 100 3202 3199 25: 2591 100 2963 2959 | 35783 100 3188 3185 ---------------------------------- | ------------------------------ Avr: 100 3219 3214 | 100 3194 3191 Tot: 100 3206 3202
STREAM
Тест производительности ОЗУ.
Вывод теста STREAM процессора Байкал-S
Array size = 4947848500 (elements), Offset = 0 (elements) Memory per array = 37749.1 MiB (= 36.9 GiB). Total memory required = 113247.3 MiB (= 110.6 GiB). Each kernel will be executed 3 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 12 Number of Threads counted = 12 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 1196001 microseconds. (= 1196001 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 81787.6 0.968455 0.967941 0.968970 Scale: 78037.2 1.015015 1.014459 1.015572 Add: 76156.5 1.559800 1.559267 1.560332 Triad: 75113.8 1.582214 1.580914 1.583514 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
Вывод теста STREAM процессора Kunpeng 920
Array size = 20000000 (elements), Offset = 0 (elements) Memory per array = 152.6 MiB (= 0.1 GiB). Total memory required = 457.8 MiB (= 0.4 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 96 Number of Threads counted = 96 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 3901 microseconds. (= 3901 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 93271.5 0.004566 0.003431 0.006441 Scale: 75846.4 0.005011 0.004219 0.006355 Add: 78009.4 0.007845 0.006153 0.010400 Triad: 76593.7 0.007606 0.006267 0.009237 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays
Blender
Вывод теста Blender процессора Байкал-S
blender -b RyzenGraphic_27.blend -f 1 -- --cycles- device CPU Blender 2.83.5 Fra:1 Mem:310.57M (0.00M. Peak 831.94M) | Time:00:16.02 | Mem:260.67M. Peak:262.29M | Scene. RenderLayer | Rendered 625/625 Tiles Fra:1 Mem:310.56M (0.00M. Peak 831.94M) | Time:00:16.02 | Mem:260.67M. Peak:262.29M | Scene. RenderLayer | Finished Fra:1 Mem:48.06M (0.00M. Peak 831.94M) | Time:00:16.05 | Sce: Scene Ve:0 Fa:0 La:0 Saved: '/tmp/0001.png' Time: 00:16.96 (Saving: 00:00.91)
Вывод теста Blender процессора Kunpeng 920
blender -b RyzenGraphic_27.blend -f 1 Fra:1 Mem:49.27M (0.00M, Peak 822.50M) | Time:00:07.54 | Sce: Scene Ve:0 Fa:0 La:0 Saved: '/tmp/0001.png' Time: 00:08.20 (Saving: 00:00.65) 8 секунд на 96 ядрах на 48 ядрах (2 проц): Time: 00:13.56 (Saving: 00:00.66) на 24 потоках: Saved: '/tmp/0001.png' Time: 00:24.92 (Saving: 00:00.65) на 8 потоках: Time: 01:10.24 (Saving: 00:00.65)
Ссылки
ссылка на оригинал статьи https://habr.com/ru/post/695484/
Добавить комментарий