Yet another voice in the mix...and unfortunately I have anything helpful to say about Pd implementations.
I know for sure the Sirlabs vocoder uses the analog-style approach using band-pass filter banks:
https://www.sirlab.de/linux/download_vocoder.html
I also made one for Rakarrack years ago:
https://sourceforge.net/p/rakarrack/git/ci/master/tree/src/Vocoder.C
https://sourceforge.net/p/rakarrack/git/ci/master/tree/src/Vocoder.h
[audio starts at about 2:35 -- sorry bad video]
The unique feature was implementation of dynamic range compression in the (voice) side-chain. This helped reduce how much you need to swallow the mic for it to work. It also allows for an arbitrary number of channels.
Unfortunately this is the same situation as with @matt has offered: It is not straightforward to implement any of these on Bela for somebody not familiar with C programming. The new challenge with Rakarrack or the sirlabs vocoder is (even though these are complete fully-functioning vocoders) pulling the right files from the sources and getting the setup, inputs & outputs correctly configured in Bela.
As for making a patch in puredata, maybe the best answer is "try, try again". This is fun and rewarding when you do it yourself and succeed. If you understand the basic concept, then the problems in your failed attempts are likely simple. Each channel has two identically-tuned band pass filters (except you might try playing with different Q between detector and carrier filters)
Voice->Filter1 -> envelope detector
Carrier->Filter2 -> variable gain cell
Envelope detector -> sets gain in variable gain cell.
Putting a dynamic range compressor in the voice channel helps make it a little more sensitive since you can jack up the gain to capture softer speech without overloading the circuit too badly when you scream.
The "gotchas":
1) Attack/release times on the envelope detector. If too short then the output will bee too distorted. If too long then the formants will be smeared and you won't get a very prominent sound.
2) Filter Q: higher resonance generally helps make for more intelligible speech, but if you have insufficient filter bands then this simply sounds bad...well it sounds bad when extreme no matter how many bands you have.
3) Filter bands: At least 7
4) Filter tuning: Logarithmically spaced evenly to cover 200 Hz to 4 kHz. Add high- and low-pass filters above min and max bands. If you spread your filters out like a graphical EQ (20 Hz to 20 kHz) then you're wasting the vocoder's resolution on bands that do not contain the formants.
I found in practice tuning to the Crybaby wah typical range (450 Hz to 2.5 kHz) covers almost all of the interesting formants. This is a hint most of your bands should be focused in this range. This is all a trade-off of number of bands and frequency resolution on the bands where formants are dominant.
Each of those parameters can be run-time adjustable. If you design your patch to allow you to dial in these parameters then you can find out what works best for your voice and carrier source.