wiki/content/note/bonemad.md
2017-03-13 21:03:07 +01:00

122 lines
3.8 KiB
Markdown

+++
date = 2012-03-07T00:00:00+00:00
title = "libmad on the BeagleBone"
tags = ["embedded"]
+++
(or really any Cortex-A)
\[\[notes/bonemad\]\]
I'm mainly interested in performance as power used ~= 1/performance.
Using libmad-0.15.1b-7ubuntu1 from Precise and Linaro GCC 4.6 2012.02 as
a cross build setup.
Has its own complicated and out of date way of selecting the
optimisations. Ubuntu turns this into a -O2. Scans the `CFLAGS` to pull
out the optimisation.
At 720 MHz over USB.
Default -O2 setup: 23.4 s, 21.4 s, 21.5 s, 21.5 s
Switch to userspace governor and lock at 720 MHz: 21.2 s, 21.2 s, 21.2 s
\-O3 setup: 21.2 s, 21.2 s...
...as it's picking up the system libmad\!
\-O3 setup: 21.5 s, 21.5 s
\-O2 setup: 21.5 s, 21.5 s
There's very little difference in size - ~30 bytes. As the Ubuntu patch
forces it to -O2 and ignores the earlier CFLAGS parsing.
\-O3 setup: 21.7 s, 21.7 s - slower\!
Disable the assembly routines and see how it changes.
\-O3 noasm: 20.0 s, 20.0 s
\-O2 noasm: 20.4 s, 20.4 s
\-O3 noasm -mfpu=neon (turns on the vectoriser): 19.9 s, 19.9 s.
Not much change which suggests a very hot function or some bad code.
perf time\!
perf report:
55.17% minimad libc-2.13.so [.] _IO_putc
15.50% minimad libmad.so.0.2.1 [.] synth_full
7.02% minimad libmad.so.0.2.1 [.] III_decode
6.51% minimad libmad.so.0.2.1 [.] loop
5.29% minimad libmad.so.0.2.1 [.] dct32
2.64% minimad minimad [.] output
1.81% minimad libmad.so.0.2.1 [.] III_imdct_l
1.29% minimad libmad.so.0.2.1 [.] mad_bit_read
0.97% minimad libmad.so.0.2.1 [.] III_aliasreduce
0.96% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
0.89% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
0.54% minimad minimad [.] mad_stream_errorstr@plt
0.41% minimad minimad [.] mad_decoder_finish@plt
Or, in other words, dominated by the sample writer in minimad. Probably
due to this:
sample = scale(*left_ch++);
putchar((sample >> 0) & 0xff);
putchar((sample >> 8) & 0xff);
if (nchannels == 2) {
sample = scale(*right_ch++);
putchar((sample >> 0) & 0xff);
putchar((sample >> 8) & 0xff);
}
Writes 1152 samples per callback. Turn this into a scale-to-buffer to
keep some semblance of an output layer...
8.3 s, 8.3 s. Much better. perf shows:
36.80% minimad libmad.so.0.2.1 [.] synth_full
17.44% minimad libmad.so.0.2.1 [.] III_decode
15.46% minimad libmad.so.0.2.1 [.] loop
13.12% minimad libmad.so.0.2.1 [.] dct32
3.84% minimad libmad.so.0.2.1 [.] III_imdct_l
3.32% minimad libmad.so.0.2.1 [.] mad_bit_read
2.33% minimad libmad.so.0.2.1 [.] III_aliasreduce
2.31% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
2.21% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
0.59% minimad minimad [.] output
\-O3 noasm novect: 8.6 s, 8.7 s
Note the noasm version is lower fidelity. I guess the assembly version
keeps things in 64 bits for longer.
\-O3 asm novect: 10.3 s, 10.3 s
\-O3 noasm vect -marm: 8.2 s, 8.2 s. So Thumb-2 is similar to ARM mode.
\-O3 noasm vect -mtune=cortex-a8: 8.8 s, 8.1 s, 8.1 s. Tuned for A8
instead of A9 is slightly better.
## Hot functions
Decent so far is -O3 noasm vect -mtune=cortex-a8 in Thumb-2. Hot
functions are:
37.33% minimad libmad.so.0.2.1 [.] synth_full
17.46% minimad libmad.so.0.2.1 [.] loop
15.93% minimad libmad.so.0.2.1 [.] III_decode
12.10% minimad libmad.so.0.2.1 [.] dct32
4.23% minimad libmad.so.0.2.1 [.] III_imdct_l
2.75% minimad libmad.so.0.2.1 [.] mad_bit_read
synth\_full is dominated by ML0 and MLAs. There's a 1..16 loop in there
which the vectoriser could hit. The compiler is spotting the mlas.