What bit of the project should i focus on to start optimizing it? If you give me a specific task and context, i can probably contribute; i just don't have any general understanding of the project as a whole.
You won't get much out of a chip with such small L1 (16KB + 16 KB) and L2 (256KB) caches running MAME.
Your best bet is probably to optimize the CPU cores, where optimize means use some existing ARM assembly core (but beware of those, they probably aren't very accurate).
The next step probably is to write a new DRC back-end.
(All of that should only be done once you'll have profiled SDLMAME on the real target of course.)