Specific question about CPU's and multiplication

Luna_blade · 17, May, 2016, 04:26:09 AM

Here I am with a crazy hardware question, but let's get to the point.

Does anybody know a bit about multiplication in a CPU?
I know it is usually adding in essence, but there is a thing.
When you add two binary numbers, the new number always takes up one more space (11 + 11 = 110).
But when multiplying, it's the sum of spaces (11 * 11 = 1001).
So take an 64 bit processor. The two binary numbers it accepts (input) are 64 bits long. But the output is also 64 bits long.
So what happens with the other 64 bits? Or are a 64 bit processors just not even capable of multiplying 2 x 64bits?

Daddy Poi's Oily Gorillas · 17, May, 2016, 06:08:35 AM

I don't really know much about the workings of the mul instruction... But it seems like an expensive instruction... Which is obvious when/if you see it replaced with bit shifts for some numbers in source code, though... Like a *2 or *4 could be converted to <<1 or <<2 respectively...
Division is even more so expensive...

I wonder if a table like the sin/cos table could even be a thing.... Can't say. (Can't see how that would be necessary, though.)

QuoteSo what happens with the other 64 bits? Or are a 64 bit processors just not even capable of multiplying 2 x 64bits?

The Overflow error? I am going to take a guess and say those most significant bits are cut off. (I know this is true with bit shifts.)

Slightly unrelated: And from what I know on the GBA... you can pretty much keep adding 1 to the number and it will increase by 1.. (So like, if you have 0xFFFFFFFF (-1 as signed/4294967295 as unsigned), you add one, it becomes 0x00000000.... And in ARM mode, you can still do some 64-bit operations... even though all registers are 32-bit, the thing is that the result goes in two registers.

Those are on integer datatypes.... I don't really use Decimal (32-bit fixed-point) / Single (32-bit floating-point) / Double (64-bit floating-point) , so can't really help there.

immortaleeb · 17, May, 2016, 10:29:32 AM

If I remember correctly, on a regular old 32-bit x86 processor the result is simply stored in two registers

e.g: mul src would result in edx:eax = src * eax
where edx stores the most significant bits of the 64-bit result and eax stores the least significant bits.
I'm gonna guess that x64 processors do the same thing.

Apparently RISC architectures like ARM do the same thing, except you need to use a special instruction to keep the most significant bits (i.e. the default mul command assumes the result will fit in a 32-bit register).

Luna_blade · 17, May, 2016, 12:45:13 PM

Quote from: Fox on 17, May, 2016, 06:08:35 AM
I wonder if a table like the sin/cos table could even be a thing.... Can't say. (Can't see how that would be necessary, though.)

If the multiplier in a CPU doesn't use adders, it's like a big table.

Quote from: Fox on 17, May, 2016, 06:08:35 AM
Those are on integer datatypes.... I don't really use Decimal (32-bit fixed-point) / Single (32-bit floating-point) / Double (64-bit floating-point) , so can't really help there.

I wasn't talking about specific data types. Just what happens when to 64bit numbers get mulitplied in a CPU.

Quote from: immortaleeb
If I remember correctly, on a regular old 32-bit x86 processor the result is simply stored in two registers

e.g: mul src would result in edx:eax = src * eax
where edx stores the most significant bits of the 64-bit result and eax stores the least significant bits.
I'm gonna guess that x64 processors do the same thing.

Apparently RISC architectures like ARM do the same thing, except you need to use a special instruction to keep the most significant bits (i.e. the default mul command assumes the result will fit in a 32-bit register).

That makes sense.
But how can the cpu store that second part? Is the result stored in the cache and then transported to the ram?

immortaleeb · 17, May, 2016, 04:23:54 PM

Quote from: Luna_blade on 17, May, 2016, 12:45:13 PM
That makes sense.
But how can the cpu store that second part? Is the result stored in the cache and then transported to the ram?

Well, this brings us really deep into how a processor actually works, but I'll try to explain at a higher level what they taught me at school.
Small disclaimer: it's been 5 years since I had the course that taught this stuff and it was given in different language than English, so I might be wrong on some things or in some terminology - please someone correct me when I'm wrong.

Basically the part of a processor that is responsible for a multiplication (or pretty much any other arithmetic operation) has some sort of small memory cell that's part of the actual circuit for that operation.
This memory cell is used to store intermediate results of an operation and most of the time has a size equal to the word-size of the processor (eg 32-bit or 64-bit). In the case of a multiplication or division this could be 2 different cells, or just a single 2-word sized memory cell.
You could look at this as a sort of private cache/register that only that arithmetic operation can use before it passes its result to the rest of the CPU's pipeline.

As an example imagine you would implement a bit-shift circuit that performs bit-shifting one bit at a time, one bit every cpu-cycle.
If you want to bit-shift 3 bits that means you'd need 3 cycles to execute that operation, so you can't finish it in a single operation. That means you need to somewhere store your intermediate results (bit-shift by 1 and bit-shift by 2), hence the memory cell.

Also note that you can't just use the regular old registers to store these temporary results because most processors these day are superscalar out-of-order processors which basically means they execute multiple different instructions at the same time, so you might have multiple operations processing during a same cycle that each have their own intermediate results they need to store. (If you want to know more about this, please use google or take a course, it would be too much for me to explain :P).

Lastly I'd like to point out that final results of simple operations are never stored directly in RAM but in registers (unless you specified a memory indirection, but again this would be given via a register) because registers are memory INSIDE the processor itself and thus the fastest to do operations on.
When talking about processors we're basically talking about operations that take MICROseconds, whilst asking something to RAM would take several MILLIseconds, so lots and lots slower.
That's why we have caches in CPUs (and several layers of them, with L1 the smallest but fastest layer to access, and e.g. L3 the biggest, slowest memory layer). They aren't used to store results from operations, but rather to store copies of data in RAM so we can access that data faster (preferably in the MICROsecond range) than when we would go to RAM.

That last part brings us a bit of topic, but I would like to point it out cause I noticed you mixed up some things in your comment.

Daddy Poi's Oily Gorillas · 17, May, 2016, 06:47:08 PM

This may be trivial, but just in-case (since can't tell if you knew or not.)... The word is "taught", and "E" on English is capitalized. (Languages get capitalized in English unlike some languages (i.e. Spanish does inglés/español/etc.)/subjects do not get capitalized... so "math"/"science"/"history" should be fine, I think.)

QuoteIf the multiplier in a CPU doesn't use adders, it's like a big table.

Well, it could depend on the implementations/tricks... but I haven't really thought of any. (And you would need an address adder to get to the entry in the table, anyway... so...)

QuoteI wasn't talking about specific data types. Just what happens when to 64bit numbers get mulitplied in a CPU.

Didn't think so, but just in case you forgot about those datatypes, I brought them up anyway... or something.

Luna_blade · 18, May, 2016, 05:26:00 AM

Quote from: immortaleeb on 17, May, 2016, 04:23:54 PM
Well, this brings us really deep into how a processor actually works, but I'll try to explain at a higher level what they taught me at school.
Small disclaimer: it's been 5 years since I had the course that taught this stuff and it was given in different language than English, so I might be wrong on some things or in some terminology - please someone correct me when I'm wrong.
... (quote shortened)

Wow that really explains alot. Thanks.
EDIT: Okay, little extra question.
I suppose every arithmetic operation unit has usually only a few of these memory cells? Not >20?

immortaleeb · 19, May, 2016, 04:57:37 AM

Quote from: Luna_blade on 18, May, 2016, 05:26:00 AM
I suppose every arithmetic operation unit has usually only a few of these memory cells? Not >20?

From a purely functional standpoint you'd only need enough memory cells to save your intermediate results.
Since most instructions only use 1, 2 or 3 word-sized inputs and produce 1 or 2 word-sized outputs, 1 to 3 word-sized memory cells should be enough for most instructions.
It wouldn't make any sense to allocate more physical space on the chip to private memory cells you won't ever need during a calculation, you'd better use that for actual registers or to increase cache sizes.

News:

Specific question about CPU's and multiplication