Those are some good questions. They've come up over the years before, but I haven't seen anyone ask them in a long while, so I'll take them one by one:
I haven't really verified (sorry - can't from where I am), but seems that MAME still is a single core application. I think there is a multithread switch somewhere in the ini but it used to be "not recommended". I might be totally wrong because I don't remember.
MAME is multithreaded to the extent that it currently can be, within the two most important constraints of A) what the developers are interested in implementing, and B) what the developers have time to implement.
MAME will happily use whatever cores you throw at it, up to about 4 or 5 (there are diminishing returns here), in order to accelerate 3D drawing in drivers that support it. The actual rasterization of the 3D for the games on these drivers is done in software, rather than utilizing your GPU, for reasons I'll go over later. The drivers and games that benefit from this are largely those that use the poly.h system for threading off work units, which include Atari/Midway Seattle and Vegas games (CarnEvil, San Francisco Rush, NFL Blitz, Gauntlet Legends), the Nintendo 64 driver, Sega Model 2, Sega Model 3, the Gaelco 3D games (Radikal Bikers, Surf Planet, and Speed Up), and a number of others.
This can make a real difference in the performance of these games, although you wind up with diminishing returns: If you have two triangles that cover the same vertical areas of the screen, they need to be drawn in-order, and as such they can't be assigned to different CPU cores, as you don't know which one will be done first. The more triangles you have, the more you can bucket them off onto different CPU cores, but the more contention you have for these buckets.
So could I ask, what is the current status and future (where "future" means next few months or short years) prospect of MAME doing things in parallel?
By that I don't mean only utilizing multiple cores (and threads), but also utilizing GPU cores and such new technologies where possible.
The only thing modern general-purpose GPU computing technology would be good for, in the context of MAME, is providing additional acceleration to the rendering on drivers that have 3D hardware.
Based on your questions, it appears that you think that it should be possible to run different devices on different host CPU cores, and that's just not something that's either worth doing or even doable at non-glacial speeds. You also appear to misunderstand how MAME works internally.
First, the different chips in an arcade machine - and indeed, any computer or console - are in fact very tightly-coupled. When a signal goes high or low, it happens within a single clock tick or less. This means that these chips can potentially exchange data, relying on each others' results, with remarkably tight synchronization. For an example of how costly that kind of synchronization is on a component level, just check out the driver for Pong - it doesn't even have a CPU, yet it's one of the slower drivers in MAME. By contrast, the different cores in your PC operate at their best when they have little to no communication between them.
By way of an analogy, the components on a PCB are more akin to a team interacting on a football pitch for a common goal, whereas the cores in your CPU are more like a synchronized swimming team, all doing the same thing but in unison.
I would never say to sacrifice the core mission of MAME to emulate hardware in the lowest possible level and accuracy (I am not talking about taking "shortcuts"*), but real hardware that MAME tries to emulate DOES run things in parallel anyway, right? (I mean each component does its job, not one component after the other)
Please shed some light?
---
(* this just came to me: "shortcuts" maybe shouldn't be out of the question either... I mean, sometimes just to get things forward SIMULATING a part may be a stepping stone towards EMULATING it... maybe a component's status in the future could be "not working", "incomplete", "simulated", "working". This would kick forward many mechanical machines I believe...)
Thoughts?
And this is why I think you fundamentally misunderstand how MAME works internally. Compared to the lowest possible level of accuracy - emulating all of the ICs on a board individually and how they're interconnected - MAME is nowhere near on that level in most cases. The developers deviate from that particular level of accuracy all the time. The best examples I can think of are the implementations of the video hardware in various 80's (and later) arcade games: in most cases it's not handled by a single IC, it's either implemented with multiple ASICs or controllers, or with a bunch of TTL ICs. It's rare indeed for MAME to emulate the video hardware on that level.
I'm looking at making that sort of trade-off myself right now: I'm looking into emulating the Fairlight CMI series of synthesizers, and I have schematics for two of the types of boards that would be in the rack unit. I could, in theory, emulate the components on the schematic using MAME's netlist library. In practice, I'm very unlikely to do that, because at the end of the day the schematics already exist and I feel that MAME loses its documentary value for the games themselves if it needlessly sacrifices speed for accuracy when it can be both sufficiently speedy and sufficiently accurate, when evaluated in context.
Accuracy is not an all-or-nothing approach, and it ultimately comes down to what level of accuracy a given driver author is comfortable with, and a tacit agreement among the developers as to a minimum level of accuracy that we strive to maintain. MAME wouldn't be able to benefit from any additional level of parallelism other than accelerating 3D drivers, quite simply.