wiki/content/note/bonemad.md

+++
date = 2012-03-07T00:00:00+00:00
title = "libmad on the BeagleBone"
tags = ["embedded"]
+++

(or really any Cortex-A)

\[\[notes/bonemad\]\]

I'm mainly interested in performance as power used ~= 1/performance.

Using libmad-0.15.1b-7ubuntu1 from Precise and Linaro GCC 4.6 2012.02 as
a cross build setup.

Has its own complicated and out of date way of selecting the
optimisations. Ubuntu turns this into a -O2. Scans the `CFLAGS` to pull
out the optimisation.

At 720 MHz over USB.

Default -O2 setup: 23.4 s, 21.4 s, 21.5 s, 21.5 s

Switch to userspace governor and lock at 720 MHz: 21.2 s, 21.2 s, 21.2 s

\-O3 setup: 21.2 s, 21.2 s...

...as it's picking up the system libmad\!

\-O3 setup: 21.5 s, 21.5 s

\-O2 setup: 21.5 s, 21.5 s

There's very little difference in size - ~30 bytes. As the Ubuntu patch
forces it to -O2 and ignores the earlier CFLAGS parsing.

\-O3 setup: 21.7 s, 21.7 s - slower\!

Disable the assembly routines and see how it changes.

\-O3 noasm: 20.0 s, 20.0 s

\-O2 noasm: 20.4 s, 20.4 s

\-O3 noasm -mfpu=neon (turns on the vectoriser): 19.9 s, 19.9 s.

Not much change which suggests a very hot function or some bad code.
perf time\!

perf report:

    55.17%  minimad  libc-2.13.so       [.] _IO_putc
    15.50%  minimad  libmad.so.0.2.1    [.] synth_full
     7.02%  minimad  libmad.so.0.2.1    [.] III_decode
     6.51%  minimad  libmad.so.0.2.1    [.] loop
     5.29%  minimad  libmad.so.0.2.1    [.] dct32
     2.64%  minimad  minimad            [.] output
     1.81%  minimad  libmad.so.0.2.1    [.] III_imdct_l
     1.29%  minimad  libmad.so.0.2.1    [.] mad_bit_read
     0.97%  minimad  libmad.so.0.2.1    [.] III_aliasreduce
     0.96%  minimad  libmad.so.0.2.1    [.] normal_block_x0_to_x17
     0.89%  minimad  libmad.so.0.2.1    [.] normal_block_x18_to_x35
     0.54%  minimad  minimad            [.] mad_stream_errorstr@plt
     0.41%  minimad  minimad            [.] mad_decoder_finish@plt

Or, in other words, dominated by the sample writer in minimad. Probably
due to this:

    sample = scale(*left_ch++);
    putchar((sample >> 0) & 0xff);
    putchar((sample >> 8) & 0xff);

    if (nchannels == 2) {
      sample = scale(*right_ch++);
      putchar((sample >> 0) & 0xff);
      putchar((sample >> 8) & 0xff);
    }

Writes 1152 samples per callback. Turn this into a scale-to-buffer to
keep some semblance of an output layer...

8.3 s, 8.3 s. Much better. perf shows:

    36.80%  minimad  libmad.so.0.2.1    [.] synth_full
    17.44%  minimad  libmad.so.0.2.1    [.] III_decode
    15.46%  minimad  libmad.so.0.2.1    [.] loop
    13.12%  minimad  libmad.so.0.2.1    [.] dct32
     3.84%  minimad  libmad.so.0.2.1    [.] III_imdct_l
     3.32%  minimad  libmad.so.0.2.1    [.] mad_bit_read
     2.33%  minimad  libmad.so.0.2.1    [.] III_aliasreduce
     2.31%  minimad  libmad.so.0.2.1    [.] normal_block_x18_to_x35
     2.21%  minimad  libmad.so.0.2.1    [.] normal_block_x0_to_x17
     0.59%  minimad  minimad            [.] output

\-O3 noasm novect: 8.6 s, 8.7 s

Note the noasm version is lower fidelity. I guess the assembly version
keeps things in 64 bits for longer.

\-O3 asm novect: 10.3 s, 10.3 s

\-O3 noasm vect -marm: 8.2 s, 8.2 s. So Thumb-2 is similar to ARM mode.

\-O3 noasm vect -mtune=cortex-a8: 8.8 s, 8.1 s, 8.1 s. Tuned for A8
instead of A9 is slightly better.

## Hot functions

Decent so far is -O3 noasm vect -mtune=cortex-a8 in Thumb-2. Hot
functions are:

    37.33%  minimad  libmad.so.0.2.1    [.] synth_full
    17.46%  minimad  libmad.so.0.2.1    [.] loop
    15.93%  minimad  libmad.so.0.2.1    [.] III_decode
    12.10%  minimad  libmad.so.0.2.1    [.] dct32
     4.23%  minimad  libmad.so.0.2.1    [.] III_imdct_l
     2.75%  minimad  libmad.so.0.2.1    [.] mad_bit_read

synth\_full is dominated by ML0 and MLAs. There's a 1..16 loop in there
which the vectoriser could hit. The compiler is spotting the mlas.