You don't design CPUs for a living unless you're talking about the manufacturing process, or maybe you're just bad at it and work for Intel. Your understanding of how FPGA works is super flawed, and your boner for GPUs is awkward. Let me explain some things as someone who actually works in this industry.
Matrix math is just stupid for whatever you pipe through it. It does the input, and gives an output.
That is exactly what all these "NPU" co processing cores are about from AMD, Intel, and to a further subset Amazon and Google on whatever they're calling their chips now. They are all about an input and output for math operations as fast as possible.
In my own work, these little AMD XDNA chips pop out multiple segmented channels way better than GPUs when gated for single purpose. Image inference, audio, logic, you name it. And then, SHOCKER!, if I try and move this to a cloud instance, I can reprogram the chip on the fly to swap from one workload to another in 5ms. It's not just a single purpose math shoveling instance anymore, it's doing articulations on audio clips, or if the worker wants, doing ML transactions for data correlation. This costs almost 75% less than provisioning stock sets of any instances to do the same workload.
They are a great way to prototype ASICs or for performing relatively simple low latency/high-throughput tasks below the economies of scale where actually taping out an ASIC would make sense but there is pretty much no case where an FPGA with a bunch of the same logic path is going to outperform a dedicated ASIC of the same logic.
NPUs are already the defacto ASIC accelerator for ML. Trying to replicate that functionality on an FPGA fabric of an older process node with longer path lengths constraining timing is going to be worse than a physically smaller dedicated ASIC.
It was the same deal with crypto-mining, the path for optimizing parallel compute is often doing it badly on a GPU first, moving to FPGA if memory isn't a major constraint, then tape out ASICs once the bugs in the gateware are ironed out (and economies of scale allow)
And that doesn't even begin to cover the pain of FPGA tooling in general and particularly vendor HLS stacks.
Matrix math is just stupid for whatever you pipe through it. It does the input, and gives an output.
Indeed.
That is exactly what all these "NPU" co processing cores are about from AMD, Intel, and to a further subset Amazon and Google on whatever they're calling their chips now. They are all about an input and output for math operations as fast as possible.
Yes, they are all matrix math accelerators, and none of which have any FPGA aspects.
You don't design CPUs for a living unless you're talking about the manufacturing process, or maybe you're just bad at it and work for Intel. Your understanding of how FPGA works is super flawed, and your boner for GPUs is awkward. Let me explain some things as someone who actually works in this industry.
Matrix math is just stupid for whatever you pipe through it. It does the input, and gives an output.
That is exactly what all these "NPU" co processing cores are about from AMD, Intel, and to a further subset Amazon and Google on whatever they're calling their chips now. They are all about an input and output for math operations as fast as possible.
In my own work, these little AMD XDNA chips pop out multiple segmented channels way better than GPUs when gated for single purpose. Image inference, audio, logic, you name it. And then, SHOCKER!, if I try and move this to a cloud instance, I can reprogram the chip on the fly to swap from one workload to another in 5ms. It's not just a single purpose math shoveling instance anymore, it's doing articulations on audio clips, or if the worker wants, doing ML transactions for data correlation. This costs almost 75% less than provisioning stock sets of any instances to do the same workload.
You have no idea what you're talking about.
Why are you being so condescending about this?
FPGAs are a great tool, but they're not magic.
They are a great way to prototype ASICs or for performing relatively simple low latency/high-throughput tasks below the economies of scale where actually taping out an ASIC would make sense but there is pretty much no case where an FPGA with a bunch of the same logic path is going to outperform a dedicated ASIC of the same logic.
NPUs are already the defacto ASIC accelerator for ML. Trying to replicate that functionality on an FPGA fabric of an older process node with longer path lengths constraining timing is going to be worse than a physically smaller dedicated ASIC.
It was the same deal with crypto-mining, the path for optimizing parallel compute is often doing it badly on a GPU first, moving to FPGA if memory isn't a major constraint, then tape out ASICs once the bugs in the gateware are ironed out (and economies of scale allow)
And that doesn't even begin to cover the pain of FPGA tooling in general and particularly vendor HLS stacks.
Indeed.
Yes, they are all matrix math accelerators, and none of which have any FPGA aspects.
Except AMD XDNA is a straight up FPGA, and Intel XEco is as well.
For someone who claims to work in this industry, you sure have no idea what's going on.