A float has 24 bits of precision and a double has 53 bits of precision. Luckily the vast majority of floating-point math in games is done to float (32-bit) precision, and I was quite happy to use double (64-bit precision) instructions in the emulation of FMA.Įmulating float-precision FMA with double-precision floating-point math seems like it should be easy ( narrator’s voice: it isn’t. So, what does it take to emulate FMA instructions perfectly on an 圆4 CPU that doesn’t support them? Our emulation of these would have to be perfect because I knew from my previous experience with emulating floating-point math that “pretty close” tended to mean that characters would fall through the floor, cars would bounce out of the world, etc. We weren’t sure if the 圆4 CPUs we chose would support these instructions, so emulating them quickly and accurately was going to be crucial. The Xbox 360 compiler generated FMA instructions all the time, both vector and scalar. On the Xbox 360 CPU the latency and throughput of FMA was the same as for fmul or fadd so using an FMA instead of an fmul followed by a dependent fadd would halve the latency. And, FMA instructions often have lower latency than a multiply followed by an add instruction. Well, let’s not go that far, but let’s agree that if you need to multiply two numbers and then add a third then FMA is going to be more accurate than the alternatives. So, clearly FMA gives more accurate results than separate multiply and add instructions. ![]() Then we would add 41 to get 2341, which would be rounded again to get a final answer of 2300 (2.3e3), which is less accurate than the FMA answer of 2400.Īside 1: FMA(a,b, -a*b) calculates the error in a*b, which is kind of coolĪside 2: One side effect of aside 1 is that x = a * b – a * b may not return zero if your compiler automatically generates FMA instructions Rounded to two digits we get 2400 or 2.4e3.īut if we don’t have FMA then we have to do the multiply, get 2349, which would be rounded to two digits of precision and get 2300 (2.3e3). To demonstrate this in a concrete fashion lets imagine that we are using decimal floating-point numbers with two digits of precision. That is, the multiply is done to full precision, and then the add is done, and only then is the result rounded to the final answer. The ‘fused’ part of fused multiply add means that no rounding is done until the end of the operation. These instructions take three input parameters and they multiply the first two, and then add the third. One of the things that worried me was fused multiply add, or FMA instructions. To set expectations, I’ll mention up front that I didn’t find a satisfactory solution. Then I was asked to help investigate what it would take to emulate the Xbox 360’s PowerPC CPU with an 圆4 CPU. I made some contributions to the team that taught the Xbox 360 how to emulate a lot of the original Xbox games – emulating x86 on PowerPC – and was given the job title Emulation Ninja for that work*. These ISA flip-flops did not make life easy. The Xbox three – sorry, the Xbox One – used an x86/圆4 CPU. ![]() The Xbox two – sorry, the Xbox 360 – used a PowerPC CPU. The Xbox one – sorry, the original Xbox – used an x86 CPU. We were thinking about releasing a new console, and we thought it would be nice if that console could run the games of the previous console.Įmulation is always hard, but it is made more challenging when your corporate masters keep changing CPU types. Then, we discuss some prominent open challenges that accelerators are facing, analyzing state-of-the-art solutions, and suggesting prospective research directions for the future.Years ago I worked in the Xbox 360 group at Microsoft. We complete our discussion with throughput and efficiency figures. According to it, we categorize around 100 accelerators of the last decade from both industry and academia, and critically analyze emerging trends. We define a taxonomy based on fourteen of these aspects, grouped in four macro-categories: general aspects, host coupling, architecture, and software aspects. They are special-purpose hardware structures separated from the CPU with aspects that exhibit a high degree of variability. ![]() Hardware manufacturers, out of necessity, switched their focus to accelerators, a new paradigm that pursues specialization and heterogeneity over generality and homogeneity. In recent years, the limits of the multicore approach emerged in the so-called “dark silicon” issue and diminishing returns of an ever-increasing core count.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |