When you set the blocksize above 32 (on CTAG BEAST) or 64 (on CTAG FACE) an extra thread is spawned where the hardware blocksize of 32/64 is buffered to the user-specified blocksize.I think that only works in multiples of the hardware block size, which explains the behaviour you are observing in terms of actual blocksize.
If your algorithm runs once every 480 samples and there's no way to tweak that, then indeed running it in the audio thread will cause dropouts while the cpu usage is still far from 100. In that case I suggest rolling back the blocksize to the maximum hardware block size for your board and then implement the buffering in your user code, running you processing in a dedicated real-time thread. See the FFT examples in Audio/ for an example of how that can work.