For Apple this is simple. They control the whole widget. They give you e.g. the Core ML library for developers to write machine learning stuff. Whether Core ML runs on Apple’s CPU or the Neural Engine is an implementation detail developers don’t have to care about.

Origin: Erik Engheim – Why is Apple’s M1 chip so fast?

What Apple has done is simply to take a mor radical shift towards this direction. Rather than just having general purpose cores and memory, the M1 contains a wide variety of specialized chips:

  • Central Processing Unit (CPU) — The “brains” of the SoC. Runs most of the code of the operating system and your apps.
  • Graphics Processing Unit (GPU) — Handles graphics-related tasks, such as visualizing an app’s user interface and 2D/3D gaming.
  • Image Processing Unit (ISP) — Can be used to speed up common tasks done by image processing aplications.
  • Digital Signal Processor (DSP) — Handles more mathematically intensive functions than a CPU. Includes decompressing music files.
  • Neural Processing Unit (NPU) — Used in high-end smartphones to accelerate machine learning (AI) tasks. These include voice recognition and camera processing.
  • Video encoder/decoder — Handles the power-efficient conversion of video files and formats.
  • Secure Enclave — Encryption, authentication and security.
  • Unified memory — Allows the CPU, GPU and other cores to quickly exchange information.

This is part of the reason why a lot of people working on images and video editing with the M1 Macs are seeing such speed improvements. A lot of the tasks they do, can run directly on specialized hardware.

Image for post
In blue you see multiple CPU cores accessing memory, and in green you see large numbers of GPU cores accessing memory.

Unified memory may confuse you. How is it different from shared memory? And wasn’t sharing video memory with main memory a terrible idea in the past giving low performance? Yes, shared memory was indeed bad. The reason was that the CPU and GPU had to take turns accessing the memory. Sharing it meant contention to use the databus. Basically the GPUs and CPUs had to take turns using a narrow pipe to push or pull data through.

That is not the case with Unified memory. In Unified memory the GPU cores and CPU cores can access memory at the same time. Thus in this case there is no overhead in sharing memory. In addition the CPU and GPU can tell each other about where some memory is located. Previously the CPU would have to copy data from its area of the main memory to the area used by the GPU. With unified memory, it is more like saying “Hey Mr. GPU, I got 30 MB of polygon data starting at memory location 2430.” The GPU can then start using that memory without doing any copying.

That means you can achieve significant performance gains by the fact that all the various special co-processors on the M1 can rapidly exchange information with each other by using the same memory pool.

Image for post
How Mac’s used GPUs before unified memory. There was even an option of having graphics cards outside the computer using a Thunderbolt 3 cable. There is some speculation that this may still be possible in the future.