Hide Forgot
Using tcmalloc, QEMU will see higher performance when doing I/O on virtio-blk device. See steps and numbers below. host versions: glibc-2.21-5.fc22.x86_64 kernel-4.2.3-200.fc22.x86_64 gperftools-libs-2.4-1.fc22.x86_64 QEMU is locally compiled from qemu-kvm-rhev-2.3.0-31.el7 src with below configure options: ./configure --enable-trace-backend=nop --enable-debug --target-list=x86_64-softmmu --extra-ldflags=-lrt --prefix=/home/fam/build/install --disable-gtk --extra-cflags=-Wno-error=deprecated-declarations guest versions: kernel-4.0.4-301.fc22.x86_64 fio-2.2.4-1.fc22.x86_64 How reproducible: can reproduce reliably. Steps to Reproduce: 1. Start QEMU, boot a Fedora 22 guest with a ramdisk (/dev/ram0) attached to virtio-blk-pci device: LD_PRELOAD=/usr/lib64/libtcmalloc.so.4 \ qemu-system-x86_64 \ -enable-kvm \ -name EU4OKS45 \ -pidfile /tmp/qsh/EU4OKS45/pid \ -qmp unix:/tmp/qsh/EU4OKS45/qmp.sock,server,nowait \ -m 1024 \ -vnc :0 \ -device virtio-scsi-pci,id=virtio-scsi-bus-0 \ -drive file=/home/fam/work/qsh/guest.qcow2,id=system-disk-drive,if=none,cache=writeback \ -device ide-drive,drive=system-disk-drive,id=system-disk,bootindex=1 \ -sdl \ -serial file:/tmp/qsh/EU4OKS45/serial.out \ -netdev user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22 \ -device virtio-net-pci,id=virtio-net-pci-virtio-nat-0,netdev=virtio-nat-0 \ -drive file=/dev/ram0,id=virtio-blk-disk-0,if=none,cache=none,aio=native \ -device virtio-blk-pci,drive=virtio-blk-disk-0,id=virtio-blk-0,serial=virtio-blk-device-0,ioeventfd=on 2. Run fio benchmark (4k seq read with iodepth=8 and 16 concurrent jobs) against /dev/vda in guest: fio --rw=read --bs=4k --iodepth=8 --runtime=30 --filename=/dev/vda --numjobs=16 --direct=1 --group_reporting --thread --name=fio-test-job --ioengine=libaio --time_based --size=1G 3. Shutdown VM, restart QEMU without the LD_PRELOAD= modification, rerun the same benchmark in guest. Actual results: Using tcmalloc yields ~15% higher performance than glibc: case bw iops --------------------------------- tcmalloc 414 106068 glibc 354 90676
Ramdisk is initialized as modprobe brd rd_nr=1 rd_size=1024000 The host machine is my working laptop (Lenovo T430s) with Fedora 22 on it. $cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz stepping : 9 microcode : 0x1b cpu MHz : 1292.652 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt bugs : bogomips : 5786.95 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz stepping : 9 microcode : 0x1b cpu MHz : 1260.027 cache size : 4096 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt bugs : bogomips : 5786.95 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz stepping : 9 microcode : 0x1b cpu MHz : 1261.160 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt bugs : bogomips : 5786.95 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 58 model name : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz stepping : 9 microcode : 0x1b cpu MHz : 1208.371 cache size : 4096 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 2 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt bugs : bogomips : 5786.95 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: $ cat /proc/meminfo MemTotal: 7869740 kB MemFree: 5790784 kB MemAvailable: 6754028 kB Buffers: 74284 kB Cached: 976952 kB SwapCached: 0 kB Active: 1283392 kB Inactive: 566244 kB Active(anon): 804632 kB Inactive(anon): 112976 kB Active(file): 478760 kB Inactive(file): 453268 kB Unevictable: 16 kB Mlocked: 16 kB SwapTotal: 17272828 kB SwapFree: 17272828 kB Dirty: 120 kB Writeback: 0 kB AnonPages: 798580 kB Mapped: 434236 kB Shmem: 119224 kB Slab: 117352 kB SReclaimable: 72964 kB SUnreclaim: 44388 kB KernelStack: 5904 kB PageTables: 25308 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 21207696 kB Committed_AS: 2958772 kB VmallocTotal: 34359738367 kB VmallocUsed: 381660 kB VmallocChunk: 34358947836 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 150720 kB DirectMap2M: 7927808 kB
Thanks, this is very useful information. I tried to reproduce your findings with stock qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22. Is this a valid test? Can you provide a qemu invocation which can run in single user mode (e.g., instructions to set up a serial console to the VM, or networking)? I currently have to run the reproducer under X, and this might contribute to the relatively high variance I see. I extracted the performance numbers from the / read : / line in the fio output from 20 runs each (within the same VM, after one warm-up run), with glibc malloc and tcmalloc, using: awk -F '[=, KB/s]+' '/ read : /{print $7}' # bw awk -F '[=, KB/s]+' '/ read : /{print $9}' # iops > tcmalloc = read.table("tcmalloc.bw") > glibc = read.table("glibc.bw") > t.test(tcmalloc, glibc) Welch Two Sample t-test data: tcmalloc and glibc t = -2.5096, df = 31.914, p-value = 0.01736 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -38050.417 -3953.683 sample estimates: mean of x mean of y 654117.1 675119.2 > tcmalloc = read.table("tcmalloc.iops") > glibc = read.table("glibc.iops") > t.test(tcmalloc, glibc) Welch Two Sample t-test data: tcmalloc and glibc t = -2.5096, df = 31.914, p-value = 0.01736 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -9512.5416 -988.3584 sample estimates: mean of x mean of y 163528.8 168779.2 I think this shows that glibc malloc is actually faster than tcmalloc.
Comparison between glibc malloc and jemalloc follows. > glibc = read.table("glibc.bw") > jemalloc = read.table("jemalloc.bw") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = 37.816, df = 23.405, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 167845.0 187251.5 sample estimates: mean of x mean of y 675119.2 497570.9 > glibc = read.table("glibc.iops") > jemalloc = read.table("jemalloc.iops") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = 37.816, df = 23.406, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 41961.15 46812.75 sample estimates: mean of x mean of y 168779.2 124392.3
With 2.3.0-31.el7, I can reproduce. > glibc = read.table("q2-glibc.bw") > tcmalloc = read.table("q2-tcmalloc.bw") > t.test(glibc, tcmalloc) Welch Two Sample t-test data: glibc and tcmalloc t = -8.789, df = 29.725, p-value = 9.152e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -59104.08 -36808.52 sample estimates: mean of x mean of y 382007.7 429964.0 > glibc = read.table("q2-glibc.iops") > tcmalloc = read.table("q2-tcmalloc.iops") > t.test(glibc, tcmalloc) Welch Two Sample t-test data: glibc and tcmalloc t = -8.789, df = 29.726, p-value = 9.15e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -14776.096 -9202.204 sample estimates: mean of x mean of y 95501.4 107490.6 jemalloc is even slower on this test: > glibc = read.table("q2-glibc.bw") > jemalloc = read.table("q2-jemalloc.bw") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = 8.0144, df = 34.077, p-value = 2.393e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 34861.18 58544.42 sample estimates: mean of x mean of y 382007.7 335304.8 > glibc = read.table("q2-glibc.iops") > jemalloc = read.table("q2-jemalloc.iops") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = 8.0144, df = 34.077, p-value = 2.393e-09 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 8715.295 14636.105 sample estimates: mean of x mean of y 95501.4 83825.7 Next step is to measure without --enable-debug. Apparently, it disables optimization and source fortification.
Now without --enable-debug. The difference between glibc malloc and tcmalloc is no longer statistically significant. > glibc = read.table("q2O-glibc.bw") > tcmalloc = read.table("q2O-tcmalloc.bw") > t.test(glibc, tcmalloc) Welch Two Sample t-test data: glibc and tcmalloc t = 1.6035, df = 37.897, p-value = 0.1171 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4242.287 36550.887 sample estimates: mean of x mean of y 377884.9 361730.6 > glibc = read.table("q2O-glibc.iops") > tcmalloc = read.table("q2O-tcmalloc.iops") > t.test(glibc, tcmalloc) Welch Two Sample t-test data: glibc and tcmalloc t = 1.6036, df = 37.898, p-value = 0.1171 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1060.345 9137.945 sample estimates: mean of x mean of y 94470.85 90432.05 It seems that the tcmalloc bandwidth distribution is just broader (results are less predictable): > summary(glibc) V1 Min. :330813 1st Qu.:352874 Median :371779 Mean :377885 3rd Qu.:392586 Max. :437157 > summary(tcmalloc) V1 Min. :325622 1st Qu.:339592 Median :349546 Mean :361731 3rd Qu.:373303 Max. :443457 But the jemalloc results are now much better than both tcmalloc and jemalloc. > glibc = read.table("q2O-glibc.bw") > jemalloc = read.table("q2O-jemalloc.bw") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = -15.873, df = 37.386, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -174130.9 -134720.4 sample estimates: mean of x mean of y 377884.9 532310.6 > glibc = read.table("q2O-glibc.iops") > jemalloc = read.table("q2O-jemalloc.iops") > t.test(glibc, jemalloc) Welch Two Sample t-test data: glibc and jemalloc t = -15.873, df = 37.386, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -43532.49 -33679.91 sample estimates: mean of x mean of y 94470.85 133077.05 I need to double-check this because this result is suspicious.
I did a double check of the results just for peer review. The only high-level comment I have is that we should collect more samples for testing. (In reply to Florian Weimer from comment #2) > I think this shows that glibc malloc is actually faster than tcmalloc. Agreed. It does for that configuration. (In reply to Florian Weimer from comment #3) > Comparison between glibc malloc and jemalloc follows. Agreed, it shows a statistically significant difference. Namely that jemalloc is faster. (In reply to Florian Weimer from comment #4) > With 2.3.0-31.el7, I can reproduce. Agreed. Looks like performance ranking (best to worst) is: tcmalloc, glibc, jemalloc. Something is certainly odd there. Analysis required. (In reply to Florian Weimer from comment #5) > Now without --enable-debug. The difference between glibc malloc and > tcmalloc is no longer statistically significant. Agreed (p-value > 0.05). > It seems that the tcmalloc bandwidth distribution is just broader (results > are less predictable): Agreed. > But the jemalloc results are now much better than both tcmalloc and jemalloc. Agreed. If we can find out why, we might be able to copy the technique. > I need to double-check this because this result is suspicious. Agreed.
(In reply to Florian Weimer from comment #2) > Can you provide a qemu invocation which can run in single user mode (e.g., > instructions to set up a serial console to the VM, or networking)? I > currently have to run the reproducer under X, and this might contribute to > the relatively high variance I see. My guest doesn't have X, it is a minimal F22 installation. My command line has "-vnc :0" and "-sdl" so you can access the vm tty from either vncviewer or the SDL window that is created. Also there is "-netdev user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22" option that forwards host port 10022 to guest port 22, so I can ssh to the vm from host with "ssh -P 10022 root@localhost". If you want serial console, add "console=tty0 console=ttyS0" to guest kernel boot options and add "-serial stdio" to the command line. Fam
I did another round of testing, comparing bandwidth (MB/s). tcmalloc vs. glibc qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug: 760 vs 638 qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo: 725 vs 694 qemu.git vs without --enable-debug 418 vs 376 The qemu.git absolute numbers are very suspecious but I haven't looked into it yet, but the performance advantage of tcmalloc is consistent across all four pairs.
(In reply to Fam Zheng from comment #8) > I did another round of testing, comparing bandwidth (MB/s). > > tcmalloc vs. glibc > > qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug: > 760 vs 638 > > qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo: > 725 vs 694 > > qemu.git vs without --enable-debug > 418 vs 376 > > The qemu.git absolute numbers are very suspecious but I haven't looked into > it yet, but the performance advantage of tcmalloc is consistent across all > four pairs. In order to make these differences statistically significant you need to run them multiple times, and then carry out something like the Welch's t-test (non-paired) like Florian used. How many times did you run the test? Are your reported values the mean results?
(In reply to Carlos O'Donell from comment #9) > In order to make these differences statistically significant you need to run > them multiple times, and then carry out something like the Welch's t-test > (non-paired) like Florian used. How many times did you run the test? Are > your reported values the mean results? I didn't do t-test, but each value is the mean of 16 repetitions. I'm just trying to reproduce the formal benchmarking done in BZ1213882#c5 and this configuration is where glibc was seen slower, in the case "[11] single disk + virtio_blk", where Student's t-test was actually carried out.
(In reply to Carlos O'Donell from comment #9) > (In reply to Fam Zheng from comment #8) > > I did another round of testing, comparing bandwidth (MB/s). > > > > tcmalloc vs. glibc > > > > qemu-kvm-rhev-2.3.0-31.el7 without --enable-debug: > > 760 vs 638 > > > > qemu-kvm-2.3.1-6.fc22.x86_64 from Fedora 22 repo: > > 725 vs 694 > > > > qemu.git vs without --enable-debug > > 418 vs 376 > > > > The qemu.git absolute numbers are very suspecious but I haven't looked into > > it yet, but the performance advantage of tcmalloc is consistent across all > > four pairs. > > In order to make these differences statistically significant you need to run > them multiple times, and then carry out something like the Welch's t-test > (non-paired) like Florian used. How many times did you run the test? Are > your reported values the mean results? As an exmaple: sudo yum install R # create your results in two text files one value per line # one file for glibc e.g. glibc.iops, glibc.bw # and one for tcmalloc e.g. tcmalloc.iops, glibc.bw # start R R glibc = read.table("glibc.iops") tcmalloc = read.table("tcmalloc.iops") t.test(glibc, tcmalloc) If the p-value is greater than 0.05 then there is no statistically significant difference between the means for those value. That is to say that the iops achieved under glibc and tcmalloc are the same within the noise (roughly). With small p-values like 2.2e-16, there is a significant difference between the means of populations (iops or bw) and that difference needs to be understood by the glibc team in order to implement a solution. In truth we should do a power calculation based on our estimate of the differences we're trying to detect and that will tell us roughly how many test runs we need to do to detect such a difference. Secondly, if the samples are not normal, then we will again likely need more samples for the effect size to determine if there is a real difference. Theory says they will be normal, but rule-of-thumb shows we likely need 20-30 runs minimum.
(In reply to Fam Zheng from comment #10) > (In reply to Carlos O'Donell from comment #9) > > In order to make these differences statistically significant you need to run > > them multiple times, and then carry out something like the Welch's t-test > > (non-paired) like Florian used. How many times did you run the test? Are > > your reported values the mean results? > > I didn't do t-test, but each value is the mean of 16 repetitions. I'm just > trying to reproduce the formal benchmarking done in BZ1213882#c5 and this > configuration is where glibc was seen slower, in the case "[11] single disk > + virtio_blk", where Student's t-test was actually carried out. Please have a look at R's t.test, which is Welch's t-test, which is basically always better than Student's t-test for this kind of data. If with Welch's t-test you can show a difference, then that's good, and we can look into that. I assume your testing is on your own box? The i7-3520M/8GB RAM?
(In reply to Fam Zheng from comment #10) > (In reply to Carlos O'Donell from comment #9) > > In order to make these differences statistically significant you need to run > > them multiple times, and then carry out something like the Welch's t-test > > (non-paired) like Florian used. How many times did you run the test? Are > > your reported values the mean results? > > I didn't do t-test, but each value is the mean of 16 repetitions. I'm just > trying to reproduce the formal benchmarking done in BZ1213882#c5 and this > configuration is where glibc was seen slower, in the case "[11] single disk > + virtio_blk", where Student's t-test was actually carried out. For reference: http://kvm-perf.englab.nay.redhat.com/results/regression/2015-w32/ramdisk/fio_raw_virtio_blk.html The tcmalloc gains were made in fio read for 4%, which is what we're looking to reproduce here. It looks like tcmalloc also had a 1-11% regression in fio randrw tests? Is the "read" test more important than "randrw" (random read/write)?
(In reply to Carlos O'Donell from comment #13) > (In reply to Fam Zheng from comment #10) > > (In reply to Carlos O'Donell from comment #9) > > > In order to make these differences statistically significant you need to run > > > them multiple times, and then carry out something like the Welch's t-test > > > (non-paired) like Florian used. How many times did you run the test? Are > > > your reported values the mean results? > > > > I didn't do t-test, but each value is the mean of 16 repetitions. I'm just > > trying to reproduce the formal benchmarking done in BZ1213882#c5 and this > > configuration is where glibc was seen slower, in the case "[11] single disk > > + virtio_blk", where Student's t-test was actually carried out. > > For reference: > http://kvm-perf.englab.nay.redhat.com/results/regression/2015-w32/ramdisk/ > fio_raw_virtio_blk.html > > The tcmalloc gains were made in fio read for 4%, which is what we're looking > to reproduce here. > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests? > > Is the "read" test more important than "randrw" (random read/write)? Is "[2] raw+ virtio_blk" also another feasible test to run? It had a ~9% gain in random write testing, which is different from test "[11] single disk + virtio_blk".
Note that we might want to use a Wilcoxon rank-sum test under the assumption that the means are not normal. Also note that the sample size of means for the official virt tests is only 4. Despite the test running for 60 seconds, it is still only 4 mean values for comparison. While that doesn't mean anything per-se, if normality is violated, I would expect you to need much more than 4 samples for Student's or Welch's t-test to reject the null hypothesis (as power is reduced).
(In reply to Fam Zheng from comment #7) > (In reply to Florian Weimer from comment #2) > > Can you provide a qemu invocation which can run in single user mode (e.g., > > instructions to set up a serial console to the VM, or networking)? I > > currently have to run the reproducer under X, and this might contribute to > > the relatively high variance I see. > > My guest doesn't have X, it is a minimal F22 installation. I meant the host. I want to run without the full desktop environment, in an attempt to bring down the variance between the test runs. > My command line has "-vnc :0" and "-sdl" so you can access the vm tty from > either vncviewer or the SDL window that is created. Also there is "-netdev > user,id=virtio-nat-0,hostfwd=:0.0.0.0:10022-:22" option that forwards host > port 10022 to guest port 22, so I can ssh to the vm from host with "ssh -P > 10022 root@localhost". > > If you want serial console, add "console=tty0 console=ttyS0" to guest kernel > boot options and add "-serial stdio" to the command line. Thanks, I will try that.
On a virtlab server that has no X I reran the tests 16 times and did R's t-test: $ cat read.glibc.out 409 420 428 420 421 410 413 404 427 425 429 430 431 417 430 404 $ cat read.tcmalloc.out 405 433 429 435 435 431 433 430 424 424 427 423 427 428 432 444 > glibc = read.table("read.glibc.out") > tcmalloc = read.table("read.tcmalloc.out") > t.test(glibc, tcmalloc) Welch Two Sample t-test data: glibc and tcmalloc t = -2.8394, df = 29.456, p-value = 0.008111 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -15.263413 -2.486587 sample estimates: mean of x mean of y 419.875 428.750
(In reply to Carlos O'Donell from comment #13) > It looks like tcmalloc also had a 1-11% regression in fio randrw tests? Couldn't reproduce this, as I don't see a significant difference in this test. > > Is the "read" test more important than "randrw" (random read/write)? I cannot say that, they're just different workloads. The actual importance depends on what users do with the system. But at any rate pure sequential I/O is far less common than mixed or random workload.
(In reply to Fam Zheng from comment #19) > (In reply to Carlos O'Donell from comment #13) > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests? > > Couldn't reproduce this, as I don't see a significant difference in this > test. OK. The original test showed a significant difference. Can I get access to the code that generates those tables please? > > Is the "read" test more important than "randrw" (random read/write)? > > I cannot say that, they're just different workloads. The actual importance > depends on what users do with the system. But at any rate pure sequential > I/O is far less common than mixed or random workload. So the "read" improvement of 4% would likely not be "worth" as much as the loss of 11% in "randrw" (random read write)?
(In reply to Carlos O'Donell from comment #22) > (In reply to Fam Zheng from comment #19) > > (In reply to Carlos O'Donell from comment #13) > > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests? > > > > Couldn't reproduce this, as I don't see a significant difference in this > > test. > > OK. The original test showed a significant difference. > > Can I get access to the code that generates those tables please? The tests belong to QE. Yanhui?
Created attachment 1088482 [details] perf.conf.new
Created attachment 1088483 [details] regression.new.py
(In reply to Carlos O'Donell from comment #22) > (In reply to Fam Zheng from comment #19) > > (In reply to Carlos O'Donell from comment #13) > > > It looks like tcmalloc also had a 1-11% regression in fio randrw tests? > > > > Couldn't reproduce this, as I don't see a significant difference in this > > test. > > OK. The original test showed a significant difference. > > Can I get access to the code that generates those tables please? Please see the attachments (perf.conf.new, regression.new.py) > > > > Is the "read" test more important than "randrw" (random read/write)? > > > > I cannot say that, they're just different workloads. The actual importance > > depends on what users do with the system. But at any rate pure sequential > > I/O is far less common than mixed or random workload. > > So the "read" improvement of 4% would likely not be "worth" as much as the > loss of 11% in "randrw" (random read write)?
It's worth testing with G_SLICE=always-malloc in the environment. This will match the original experiments more closely, and it will also match QEMU 2.5 which removes the g_slice_* allocator in favor of regular malloc.
(In reply to Yanhui Ma from comment #26) > > Can I get access to the code that generates those tables please? > > Please see the attachments (perf.conf.new, regression.new.py) Thanks! I see you're using scipi.stats.ttest_*, which helps us make sure we are also computing similar values when we look at the final performance numbers.
Running latest QEMU I see (with perf) a lot of L1-dcache-load-misses in malloc, that go away with tcmalloc.
(In reply to Paolo Bonzini from comment #29) > Running latest QEMU I see (with perf) a lot of L1-dcache-load-misses in > malloc, that go away with tcmalloc. We've started to make progress on this issue. DJ Delorie from the tools team is working on glibc's malloc and has an experimental hybrid cache added to glibc's malloc. Like in tcmalloc and jemalloc, DJ has added a per-thread cache (making it a hybrid of per-thread/per-cpu) which can fetch from a local pool without any locking and thereby reduce the number of cycles required to get a block of memory. The pool refill adds latency since you have to go back to the per-cpu cache to get memory, and eventually all the way back to the OS (mmap) at some point if the pressure is high enough. To reiterate, we are making progress here and the numbers are quite good (200% speedup in some <1024 byte allocations) so far in our testing of effectively the same approach as tcmalloc and jemalloc. Any win for glibc's malloc is a win for the entire system.
Created attachment 1112295 [details] test malloc - per-thread cache Here is a version of glibc's malloc to test, which has a new per-thread cache. It should work on Fedora 20+ or RHEL 7+. Use LD_PRELOAD=djmalloc.so as usual to test it. Note that since this version is split out from glibc's so, there might be some features that require integration with glibc to work correctly (i.e. there might be memory leaks due to thread exits not being cross-registered with the pthreads library). I'm providing this solely for testing performance :-) The primary boost in this new version is when small (<1024 byte) allocations happen more than once, a shorter path can be taken which is significantly faster due to a small per-thread cache.
I am asking LLNL to do some testing to verify that the work that you did adding a per-thread cache helps address their most pressing performance issue. Other issues that the point outhem are: which also affect t 1) Problems with a growth in the virtual memory address space allocated to a process. Because they run diskless, any dirty memory becomes resident in RAM and can't get paged out. Thus reclaiming arenas and mmapped regions rather than abandoning them becomes more important. 2) They also seem to have problems with glibc's malloc fragmenting memory. Since they have been taught to do malloc's in the context of the thread that they plan on using it from, DJ's work may already address this.
Created attachment 1121501 [details] test malloc - per-thread cache Fixes a bug in the previous version
This bug appears to have been reported against 'rawhide' during the Fedora 24 development cycle. Changing version to '24'. More information and reason for this action is here: https://fedoraproject.org/wiki/Fedora_Program_Management/HouseKeeping/Fedora24#Rawhide_Rebase
This bug appears to have been reported against 'rawhide' during the Fedora 27 development cycle. Changing version to '27'. More information and reason for this action is here: https://fedoraproject.org/wiki/Releases/27/HouseKeeping#Rawhide_Rebase
The per-thread cache was released in glibc 2.26 and is available in rawhide. Could you please repeat your original tests to see if the performance difference is still significant?
I don't see significant difference on rawhide now: Numbers in kIOPS: tcmalloc 204 jemalloc 217 glibc 211