Thats a fairly specialized chip and requires a bunch of custom software. The only way it can run apps unmodified is if the math libraries have been customized for this chip. If the performance is there, people will buy it.
For a minute I thought maybe it was Risc-V with a big vector unit, but its way different from that.
The quote at the end of the posted Reuters article (not the one you’re responding to) says that it doesn’t require extensive code modifications. So is the “custom software” is standard for the target customers of nextsilicon?
Companies often downplay the amount of software modifications necessary to benefit from their hardware platform's strengths because quite often, platforms that cannot run software out of the box lose out compared to those that can.
By the time special chips were completed and mature, the developers of "mainstream" CPUs had typically caught up speedwise in the past, which is why we do not see any "transputers" (e.g. Inmos T800), LISP machines (Symbolics XL1200, TI Explorer II), or other odd architectures like the Connection Machine CM-2 around anymore.
For example, when Richard Feynman was hired to work on the Connection Machine, he had to write a parallel version of BASIC first before he could write any programs for the computer they were selling:
https://longnow.org/ideas/richard-feynman-and-the-connection...
>> says that it doesn’t require extensive code modifications
If they provide a compiler port and update things like BLAS to support their hardware then higher level applications should not require much/any code modification.
It's a bit more complicated, you need to use their compiler (LVVM fork with clang+fortran). This in itself is not that special as most accelerators (ICC, nvcc, aoc) already require this.
Modifications are likely on the level of: Does this clang support my required c++ version? Actual work is only required when you want to bring something else, like Rust (AFAIK not supported).
However, to analyze the efficiency of the code and how it is interpreted by the card you need their special toolchain. Debugging also becomes less convenient.
Yeah, it's an unfortunate overlap.
The Mill-Core in NextSilicon terminology is the software defined "configuration" of the chip so to speak that represents swaths of the application that are deemed worthy of acceleration as expressed on the custom HW.
So really the Mill-Core is in a way the expression of the customer's code. really.
I can't access the page directly, because my browser doesn't leak enough identifying information to convince Reuters I'm not a bot, but an actual bot is perfectly capable of accessing the page.
The other company I can think of focusing on F64 is Fujitsu with its A64FX processor. This is an ARM64 with really meaty SIMD to get 3TFLOP of FP64.
I guess it it hard to compare chip for chip but the question is, if you are building a supercomputer (and we ignore pressure to buy sovereign) then which is better bang for the buck on representative workloads?
All processors are inherently behind. First research comes out or standards are made, then much time later silicon is fabbed. For example, pcie gen 6 was ratified years ago, but there's nothing I've seen that uses it. Maybe you could argue that their silicon is behind others' but it's all about what their market is and what their customers are demanding.
I was an architect on the Anton 2 and 3 machines - the systolic arrays that computed pairwise interactions were a significant component of the chips, but there were also an enormous number of fairly normal looking general-purpose (32-bit / 4-way SIMD) processor cores that we just programmed with C++.
I spent a lot of time on systolic arrays to compute crypto currency POW (Blake 2 specifically). It’s an interesting problem and I learned a lot but made no progress. I’ve often wondered if anyone has done the same?
Even if the hardware is really good, the software should be even better if they want to succeed.
Support for operating systems, compilers, programming languages, etc.
This is why a Raspberry Pi is still so popular even though there are a lot of cheaper alternatives with theoretically better performance. The software support is often just not as good.
If you want your customers to spend supercomputing money, you need to have a way for those customers to explore and learn to leverage your systems without committing a massive spend.
ARM, x86, and CUDA-capable stuff is available off the shelf at Best Buy. This means researchers don't need massive grants or tremendous corporate investment to build proofs of concepts, and it means they can develop in their offices software that can run on bigger iron.
IBM's POWER series is an example of what happens when you don't have this. Minimum spend for the entry-level hardware is orders of magnitude higher than the competition, which means, practically speaking, you're all-in or not at all.
CUDA is also a good example of bringing your product to the users. AMD spent years locking out ROCm behind weird market-segmentation games, and even today if you look at the 'supported' list in the ROCm documentation it only shows a handful of ultra-recent cards. CUDA, meanwhile, happily ran on your ten-year-old laptop, even if it didn't run great.
People need to be able to discover what makes your hardware worth buying.
The implication wasn't to use the raspberry pi toolchain. Just that toolchains are required and are a critical part of developing for new hardware. The Intel/AMD toolchain they will be competing with is even more mature than rpi. And toolchain availability and ease of use makes a huge difference whether you are developing for supercomputers or embedded systems. From the article:
"It uses technology called RISC-V, an open computing standard that competes with Arm Ltd and is increasingly being used by chip giants such as Nvidia and Broadcom."
So the fact that rpi tooling is better than the imitators and it has maintained a significant market share lead is relevant. Market share isn't just about performance and price. It's also about ease of use and network effects that come with popularity.
If there really is enough market demand for this kind of processor, it seems like someone like NEC who still makes vector processors would be better poised than a startup rolling RISC-V
So, a Systolic Array[1] spiced up with a pinch of control flow and a side of compiler cleverness? At least that's the impression I get from the servethehome article linked upthead. I wasn't able to find non-marketing better-than-sliced-bread technical details from 3 minutes of poking at your website.
I can see why systolic arrays come to mind, but this is different.
While there are indeed many ALUs connected to each other in a systolic array and in a data-flow chip, data-flow is usually more flexible (at a cost of complexity) and the ALUs can be thought of as residing on some shared fabric.
Systolic arrays often (always?) have a predefined communication pattern and are often used in problems where data that passes through them is also retained in some shape or form.
For NextSilicon, the ALUs are reconfigured and rewired to express the application (or parts of) on the parallel data-flow acclerator.
My understanding is no, if I understand what people mean by systolic arrays.
GreenArray processors are complete computers with their own memory and running their own software. The GA144 chip has 144 independently programmable computers with 64 words of memory each. You program each of them, including external I/O and routing between them, and then you run the chip as a cluster of computers.
Text on the front page of the NS website* leads me to think you have a fancy compiler: "Intelligent software-defined hardware acceleration". Sounds like Cerebras to my non-expert ears.
NEC doesn't really make vector processors anymore. My company installed a new supercomputer built by NEC, and the hardware itself is actually Gigabyte servers running AMD Instinct MI300A, with NEC providing the installation, support, and other services.
I have designed software for a lot of exotic compute silicon, including systems that could be described similar to this one. My useless superpower is that I am good at designing excellent data structures and algorithms for almost any plausible computing architecture.
From a cursory read-through, it isn’t clear where the high-leverage point is in this silicon. What is the thing, at a fundamental level, that it does better than any other silicon? It seems pretty vague. I’m not saying it doesn’t have one, just that it isn’t obvious from the media slop.
What’s the specific workload where I can abuse any other silicon at the same task if I write the software to fit the silicon?
You can indeed and should assume there is a heavy JIT component to it.
At the same time, it is important to note that this is geared for already highly parallel code.
In other words, while the JIT can be applied to all code in principle, the nature of accelerated HW is that it makes sense where embarrassingly parallel workloads are around.
Having said that, NextSilicon != GPU, so different approach to acceleration of said parallel code.
In a way, this is not new, it’s pretty much what annapurna did: they took ARM and got serious with it, creating the first high performance arm cpus. Then they got acqui-hired by amazon and the rest is history ;)
Stop using Apple, or Google, or Amazon, or Intel, or Broadcom, or Nvidia then. All have vast hardware development activities in that one country you don't like.
How dare you have a moral objection to buying from a state accused of genocide. Please stick to completely organic complaints about comedy festivals and soccer tournaments.
Servethehome[1] does a bit of a better job describing what maverick-2 is and why it makes sense.
[1]https://www.servethehome.com/nextsilicon-maverick-2-brings-d...
Thats a fairly specialized chip and requires a bunch of custom software. The only way it can run apps unmodified is if the math libraries have been customized for this chip. If the performance is there, people will buy it.
For a minute I thought maybe it was Risc-V with a big vector unit, but its way different from that.
The article says they are also developing a RISCV CPU
The quote at the end of the posted Reuters article (not the one you’re responding to) says that it doesn’t require extensive code modifications. So is the “custom software” is standard for the target customers of nextsilicon?
Companies often downplay the amount of software modifications necessary to benefit from their hardware platform's strengths because quite often, platforms that cannot run software out of the box lose out compared to those that can.
By the time special chips were completed and mature, the developers of "mainstream" CPUs had typically caught up speedwise in the past, which is why we do not see any "transputers" (e.g. Inmos T800), LISP machines (Symbolics XL1200, TI Explorer II), or other odd architectures like the Connection Machine CM-2 around anymore.
For example, when Richard Feynman was hired to work on the Connection Machine, he had to write a parallel version of BASIC first before he could write any programs for the computer they were selling: https://longnow.org/ideas/richard-feynman-and-the-connection...
This may also explain failures like Bristol-based CPU startup Graphcore, which was acquired by Softbank, but for less money than the investors had put in: https://sifted.eu/articles/graphcore-cofounder-exits-company...
XMOS (spiritual successor to Inmos) is still kicking around, it’s not without its challenges though, for the reasons you mention.
>> says that it doesn’t require extensive code modifications
If they provide a compiler port and update things like BLAS to support their hardware then higher level applications should not require much/any code modification.
It's a bit more complicated, you need to use their compiler (LVVM fork with clang+fortran). This in itself is not that special as most accelerators (ICC, nvcc, aoc) already require this.
Modifications are likely on the level of: Does this clang support my required c++ version? Actual work is only required when you want to bring something else, like Rust (AFAIK not supported).
However, to analyze the efficiency of the code and how it is interpreted by the card you need their special toolchain. Debugging also becomes less convenient.
I've also found their "Technology Launch" video[1] that goes somewhat deeper into the details (they also have code examples.)
[1] https://www.youtube.com/watch?v=krpunC3itSM
They've got a "Mill Core" in there- is the design related to the Mill Computing design?
Yeah, it's an unfortunate overlap. The Mill-Core in NextSilicon terminology is the software defined "configuration" of the chip so to speak that represents swaths of the application that are deemed worthy of acceleration as expressed on the custom HW.
So really the Mill-Core is in a way the expression of the customer's code. really.
They are completely different designs, but the name is inspired by the same source: the Mill component in Charles Babbage's Analytical Engine.
https://archive.is/6j2p4
I can't access the page directly, because my browser doesn't leak enough identifying information to convince Reuters I'm not a bot, but an actual bot is perfectly capable of accessing the page.
Same but I can’t access archive.is either because of the VPN
Odd that doesn't load for me but https://archive.ph/6j2p4 does
Archive.is is broken if you use cloudflare dns.
The other company I can think of focusing on F64 is Fujitsu with its A64FX processor. This is an ARM64 with really meaty SIMD to get 3TFLOP of FP64.
I guess it it hard to compare chip for chip but the question is, if you are building a supercomputer (and we ignore pressure to buy sovereign) then which is better bang for the buck on representative workloads?
If Fujitsu only releases one processor every 8 years they're going to be behind for most of the time.
All processors are inherently behind. First research comes out or standards are made, then much time later silicon is fabbed. For example, pcie gen 6 was ratified years ago, but there's nothing I've seen that uses it. Maybe you could argue that their silicon is behind others' but it's all about what their market is and what their customers are demanding.
Curious if the architecture is similar to what is called “systolic” as in the Anton series of supercomputers: https://en.wikipedia.org/wiki/Anton_(computer)
I was an architect on the Anton 2 and 3 machines - the systolic arrays that computed pairwise interactions were a significant component of the chips, but there were also an enormous number of fairly normal looking general-purpose (32-bit / 4-way SIMD) processor cores that we just programmed with C++.
I spent a lot of time on systolic arrays to compute crypto currency POW (Blake 2 specifically). It’s an interesting problem and I learned a lot but made no progress. I’ve often wondered if anyone has done the same?
You should check out AMD's NPU architecture.
Not really. I work for NextSilicon. It's a data-flow oriented design. We will eventually have more details available that gradually explain this.
Even if the hardware is really good, the software should be even better if they want to succeed.
Support for operating systems, compilers, programming languages, etc.
This is why a Raspberry Pi is still so popular even though there are a lot of cheaper alternatives with theoretically better performance. The software support is often just not as good.
Their customers are building supercomputers?
If you want your customers to spend supercomputing money, you need to have a way for those customers to explore and learn to leverage your systems without committing a massive spend.
ARM, x86, and CUDA-capable stuff is available off the shelf at Best Buy. This means researchers don't need massive grants or tremendous corporate investment to build proofs of concepts, and it means they can develop in their offices software that can run on bigger iron.
IBM's POWER series is an example of what happens when you don't have this. Minimum spend for the entry-level hardware is orders of magnitude higher than the competition, which means, practically speaking, you're all-in or not at all.
CUDA is also a good example of bringing your product to the users. AMD spent years locking out ROCm behind weird market-segmentation games, and even today if you look at the 'supported' list in the ROCm documentation it only shows a handful of ultra-recent cards. CUDA, meanwhile, happily ran on your ten-year-old laptop, even if it didn't run great.
People need to be able to discover what makes your hardware worth buying.
The implication wasn't to use the raspberry pi toolchain. Just that toolchains are required and are a critical part of developing for new hardware. The Intel/AMD toolchain they will be competing with is even more mature than rpi. And toolchain availability and ease of use makes a huge difference whether you are developing for supercomputers or embedded systems. From the article:
"It uses technology called RISC-V, an open computing standard that competes with Arm Ltd and is increasingly being used by chip giants such as Nvidia and Broadcom."
So the fact that rpi tooling is better than the imitators and it has maintained a significant market share lead is relevant. Market share isn't just about performance and price. It's also about ease of use and network effects that come with popularity.
I'm personally boycotting Israeli companies for obvious reasons.
I find it helpful to read a saxpy and GEMM kernel for a new accelerator like this - do they have an example?
If there really is enough market demand for this kind of processor, it seems like someone like NEC who still makes vector processors would be better poised than a startup rolling RISC-V
I work in NS. The riscv was the "one more thing" aspect of the "reveal".
The main product/architecture discussed has nothing to do with vector processors or riscv.
It's a new, fundamentally different data-flow processor.
Hopefully we will improve in explaining what we do and why people may want to care.
So, a Systolic Array[1] spiced up with a pinch of control flow and a side of compiler cleverness? At least that's the impression I get from the servethehome article linked upthead. I wasn't able to find non-marketing better-than-sliced-bread technical details from 3 minutes of poking at your website.
[1]: https://en.wikipedia.org/wiki/Systolic_array
I can see why systolic arrays come to mind, but this is different. While there are indeed many ALUs connected to each other in a systolic array and in a data-flow chip, data-flow is usually more flexible (at a cost of complexity) and the ALUs can be thought of as residing on some shared fabric.
Systolic arrays often (always?) have a predefined communication pattern and are often used in problems where data that passes through them is also retained in some shape or form.
For NextSilicon, the ALUs are reconfigured and rewired to express the application (or parts of) on the parallel data-flow acclerator.
Are the GreenArray chips also systolic arrays?
My understanding is no, if I understand what people mean by systolic arrays.
GreenArray processors are complete computers with their own memory and running their own software. The GA144 chip has 144 independently programmable computers with 64 words of memory each. You program each of them, including external I/O and routing between them, and then you run the chip as a cluster of computers.
[1] https://greenarraychips.com
Reminds me a bit of the Parallax Propeller chip.
Text on the front page of the NS website* leads me to think you have a fancy compiler: "Intelligent software-defined hardware acceleration". Sounds like Cerebras to my non-expert ears.
* https://www.nextsilicon.com
No real overlap with Cerebras. Have tons of respect for what they do and achieve, but unrelated arch / approach / target-customers.
NEC doesn't really make vector processors anymore. My company installed a new supercomputer built by NEC, and the hardware itself is actually Gigabyte servers running AMD Instinct MI300A, with NEC providing the installation, support, and other services.
https://www.nec.com/en/press/202411/global_20241113_02.html
I have designed software for a lot of exotic compute silicon, including systems that could be described similar to this one. My useless superpower is that I am good at designing excellent data structures and algorithms for almost any plausible computing architecture.
From a cursory read-through, it isn’t clear where the high-leverage point is in this silicon. What is the thing, at a fundamental level, that it does better than any other silicon? It seems pretty vague. I’m not saying it doesn’t have one, just that it isn’t obvious from the media slop.
What’s the specific workload where I can abuse any other silicon at the same task if I write the software to fit the silicon?
Sounds like an idea that would really benefit from a JIT-like approach to basically every software.
You can indeed and should assume there is a heavy JIT component to it. At the same time, it is important to note that this is geared for already highly parallel code.
In other words, while the JIT can be applied to all code in principle, the nature of accelerated HW is that it makes sense where embarrassingly parallel workloads are around.
Having said that, NextSilicon != GPU, so different approach to acceleration of said parallel code.
I definitely expect this to be a big hit.
In a way, this is not new, it’s pretty much what annapurna did: they took ARM and got serious with it, creating the first high performance arm cpus. Then they got acqui-hired by amazon and the rest is history ;)
I don’t want my electronics to contribute to genocide and apartheid and possibly the next pager exploding terror attack. No thanks.
It's not yours, don't have to buy it
I'd be fascinated to know who your "good guys" list is.
Stop using Apple, or Google, or Amazon, or Intel, or Broadcom, or Nvidia then. All have vast hardware development activities in that one country you don't like.
How dare you have a moral objection to buying from a state accused of genocide. Please stick to completely organic complaints about comedy festivals and soccer tournaments.