122 lines
3.8 KiB
Markdown
122 lines
3.8 KiB
Markdown
+++
|
|
date = 2012-03-07T00:00:00+00:00
|
|
title = "libmad on the BeagleBone"
|
|
tags = ["embedded"]
|
|
+++
|
|
|
|
(or really any Cortex-A)
|
|
|
|
\[\[notes/bonemad\]\]
|
|
|
|
I'm mainly interested in performance as power used ~= 1/performance.
|
|
|
|
Using libmad-0.15.1b-7ubuntu1 from Precise and Linaro GCC 4.6 2012.02 as
|
|
a cross build setup.
|
|
|
|
Has its own complicated and out of date way of selecting the
|
|
optimisations. Ubuntu turns this into a -O2. Scans the `CFLAGS` to pull
|
|
out the optimisation.
|
|
|
|
At 720 MHz over USB.
|
|
|
|
Default -O2 setup: 23.4 s, 21.4 s, 21.5 s, 21.5 s
|
|
|
|
Switch to userspace governor and lock at 720 MHz: 21.2 s, 21.2 s, 21.2 s
|
|
|
|
\-O3 setup: 21.2 s, 21.2 s...
|
|
|
|
...as it's picking up the system libmad\!
|
|
|
|
\-O3 setup: 21.5 s, 21.5 s
|
|
|
|
\-O2 setup: 21.5 s, 21.5 s
|
|
|
|
There's very little difference in size - ~30 bytes. As the Ubuntu patch
|
|
forces it to -O2 and ignores the earlier CFLAGS parsing.
|
|
|
|
\-O3 setup: 21.7 s, 21.7 s - slower\!
|
|
|
|
Disable the assembly routines and see how it changes.
|
|
|
|
\-O3 noasm: 20.0 s, 20.0 s
|
|
|
|
\-O2 noasm: 20.4 s, 20.4 s
|
|
|
|
\-O3 noasm -mfpu=neon (turns on the vectoriser): 19.9 s, 19.9 s.
|
|
|
|
Not much change which suggests a very hot function or some bad code.
|
|
perf time\!
|
|
|
|
perf report:
|
|
|
|
55.17% minimad libc-2.13.so [.] _IO_putc
|
|
15.50% minimad libmad.so.0.2.1 [.] synth_full
|
|
7.02% minimad libmad.so.0.2.1 [.] III_decode
|
|
6.51% minimad libmad.so.0.2.1 [.] loop
|
|
5.29% minimad libmad.so.0.2.1 [.] dct32
|
|
2.64% minimad minimad [.] output
|
|
1.81% minimad libmad.so.0.2.1 [.] III_imdct_l
|
|
1.29% minimad libmad.so.0.2.1 [.] mad_bit_read
|
|
0.97% minimad libmad.so.0.2.1 [.] III_aliasreduce
|
|
0.96% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
|
|
0.89% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
|
|
0.54% minimad minimad [.] mad_stream_errorstr@plt
|
|
0.41% minimad minimad [.] mad_decoder_finish@plt
|
|
|
|
Or, in other words, dominated by the sample writer in minimad. Probably
|
|
due to this:
|
|
|
|
sample = scale(*left_ch++);
|
|
putchar((sample >> 0) & 0xff);
|
|
putchar((sample >> 8) & 0xff);
|
|
|
|
if (nchannels == 2) {
|
|
sample = scale(*right_ch++);
|
|
putchar((sample >> 0) & 0xff);
|
|
putchar((sample >> 8) & 0xff);
|
|
}
|
|
|
|
Writes 1152 samples per callback. Turn this into a scale-to-buffer to
|
|
keep some semblance of an output layer...
|
|
|
|
8.3 s, 8.3 s. Much better. perf shows:
|
|
|
|
36.80% minimad libmad.so.0.2.1 [.] synth_full
|
|
17.44% minimad libmad.so.0.2.1 [.] III_decode
|
|
15.46% minimad libmad.so.0.2.1 [.] loop
|
|
13.12% minimad libmad.so.0.2.1 [.] dct32
|
|
3.84% minimad libmad.so.0.2.1 [.] III_imdct_l
|
|
3.32% minimad libmad.so.0.2.1 [.] mad_bit_read
|
|
2.33% minimad libmad.so.0.2.1 [.] III_aliasreduce
|
|
2.31% minimad libmad.so.0.2.1 [.] normal_block_x18_to_x35
|
|
2.21% minimad libmad.so.0.2.1 [.] normal_block_x0_to_x17
|
|
0.59% minimad minimad [.] output
|
|
|
|
\-O3 noasm novect: 8.6 s, 8.7 s
|
|
|
|
Note the noasm version is lower fidelity. I guess the assembly version
|
|
keeps things in 64 bits for longer.
|
|
|
|
\-O3 asm novect: 10.3 s, 10.3 s
|
|
|
|
\-O3 noasm vect -marm: 8.2 s, 8.2 s. So Thumb-2 is similar to ARM mode.
|
|
|
|
\-O3 noasm vect -mtune=cortex-a8: 8.8 s, 8.1 s, 8.1 s. Tuned for A8
|
|
instead of A9 is slightly better.
|
|
|
|
## Hot functions
|
|
|
|
Decent so far is -O3 noasm vect -mtune=cortex-a8 in Thumb-2. Hot
|
|
functions are:
|
|
|
|
37.33% minimad libmad.so.0.2.1 [.] synth_full
|
|
17.46% minimad libmad.so.0.2.1 [.] loop
|
|
15.93% minimad libmad.so.0.2.1 [.] III_decode
|
|
12.10% minimad libmad.so.0.2.1 [.] dct32
|
|
4.23% minimad libmad.so.0.2.1 [.] III_imdct_l
|
|
2.75% minimad libmad.so.0.2.1 [.] mad_bit_read
|
|
|
|
synth\_full is dominated by ML0 and MLAs. There's a 1..16 loop in there
|
|
which the vectoriser could hit. The compiler is spotting the mlas.
|
|
|