libmad on the BeagleBone
(or really any Cortex-A)
[[notes/bonemad]]
I’m mainly interested in performance as power used ~= 1/performance.
Using libmad-0.15.1b-7ubuntu1 from Precise and Linaro GCC 4.6 2012.02 as a cross build setup.
Has its own complicated and out of date way of selecting the
optimisations. Ubuntu turns this into a -O2. Scans the CFLAGS
to pull
out the optimisation.
At 720 MHz over USB.
Default -O2 setup: 23.4 s, 21.4 s, 21.5 s, 21.5 s
Switch to userspace governor and lock at 720 MHz: 21.2 s, 21.2 s, 21.2 s
-O3 setup: 21.2 s, 21.2 s…
…as it’s picking up the system libmad!
-O3 setup: 21.5 s, 21.5 s
-O2 setup: 21.5 s, 21.5 s
There’s very little difference in size - ~30 bytes. As the Ubuntu patch forces it to -O2 and ignores the earlier CFLAGS parsing.
-O3 setup: 21.7 s, 21.7 s - slower!
Disable the assembly routines and see how it changes.
-O3 noasm: 20.0 s, 20.0 s
-O2 noasm: 20.4 s, 20.4 s
-O3 noasm -mfpu=neon (turns on the vectoriser): 19.9 s, 19.9 s.
Not much change which suggests a very hot function or some bad code. perf time!
perf report:
55.17% minimad libc-2.13.so [.] _IO_putc
15.50% minimad libmad.so.0.2.1 [.] synth_full
7.02% minimad libmad.so.0.2.1 [.] III_decode
6.51% minimad libmad.so.0.2.1 [.] loop
5.29% minimad libmad.so.0.2.1 [.] dct32
2.64% minimad minimad [.] output
1.81% minimad libmad.so.0.2.1 [.] III_imdct_l
1.29% minimad libmad.so.0.2.1 [.] mad_bit_read
0.97% minimad libmad.so.0.2.1 [.] III_aliasreduce
0.96% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
0.89% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
0.54% minimad minimad [.] mad_stream_errorstr@plt
0.41% minimad minimad [.] mad_decoder_finish@plt
Or, in other words, dominated by the sample writer in minimad. Probably due to this:
sample = scale(*left_ch++);
putchar((sample >> 0) & 0xff);
putchar((sample >> 8) & 0xff);
if (nchannels == 2) {
sample = scale(*right_ch++);
putchar((sample >> 0) & 0xff);
putchar((sample >> 8) & 0xff);
}
Writes 1152 samples per callback. Turn this into a scale-to-buffer to keep some semblance of an output layer…
8.3 s, 8.3 s. Much better. perf shows:
36.80% minimad libmad.so.0.2.1 [.] synth_full
17.44% minimad libmad.so.0.2.1 [.] III_decode
15.46% minimad libmad.so.0.2.1 [.] loop
13.12% minimad libmad.so.0.2.1 [.] dct32
3.84% minimad libmad.so.0.2.1 [.] III_imdct_l
3.32% minimad libmad.so.0.2.1 [.] mad_bit_read
2.33% minimad libmad.so.0.2.1 [.] III_aliasreduce
2.31% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
2.21% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
0.59% minimad minimad [.] output
-O3 noasm novect: 8.6 s, 8.7 s
Note the noasm version is lower fidelity. I guess the assembly version keeps things in 64 bits for longer.
-O3 asm novect: 10.3 s, 10.3 s
-O3 noasm vect -marm: 8.2 s, 8.2 s. So Thumb-2 is similar to ARM mode.
-O3 noasm vect -mtune=cortex-a8: 8.8 s, 8.1 s, 8.1 s. Tuned for A8 instead of A9 is slightly better.
Hot functions
Decent so far is -O3 noasm vect -mtune=cortex-a8 in Thumb-2. Hot functions are:
37.33% minimad libmad.so.0.2.1 [.] synth_full
17.46% minimad libmad.so.0.2.1 [.] loop
15.93% minimad libmad.so.0.2.1 [.] III_decode
12.10% minimad libmad.so.0.2.1 [.] dct32
4.23% minimad libmad.so.0.2.1 [.] III_imdct_l
2.75% minimad libmad.so.0.2.1 [.] mad_bit_read
synth_full is dominated by ML0 and MLAs. There’s a 1..16 loop in there which the vectoriser could hit. The compiler is spotting the mlas.