Efficient C++ Programming for Modern 64-bit CPUs: Chapter 4/part 2

104 points by birdculture 5 days ago|38 comments

•

egl2020 3 days ago

Article title should be "Efficient C++ Programming for Modern 64-bit CPUs...".

•

adrian_b 2 days ago

Many cost relationships from TFA have already been more or less true for the 32-bit CPUs launched after 1990 and they all became true for the 32-bit high-end CPUs launched after 2000 (like Intel Pentium 4 and AMD Athlon XP), when the difference between the CPU clock frequency and the DRAM latency became almost as high as today.

Only for the 32-bit CPUs used in microcontrollers, which may have clock frequencies under 100 MHz and which may lack a cache hierarchy, the cost differences between many kinds of operations may collapse.

For instance even for not too old 32-bit CPUs it is right to classify the instructions in the following groups, based on their cost in clock cycles:

1. Simple integer operations with operands in registers

2. Loads from the L1 cache memory and simple floating-point operations, like addition and multiplication

3. Loads from the L2 cache memory, division (integer or floating-point), square root and mispredicted branches

4. Loads from the L3 cache memory and atomic read-modify-write operations (like atomic exchange, atomic fetch-and-add, atomic compare-and-swap)

5. Loads from the main memory

This classification matches the chart from TFA.

•

spwa4 2 days ago

That's what people don't really understand about CPUs these days. DRAM is stuck on 10nm (and even that was a big effort to move there). The capacitor circuit DRAM uses doesn't work if you reduce the size much more, and so it can't be scaled down, and this is not changing. We're pretty much stuck on memory speed almost regardless of chip advances (at least for the individual chips, but we're already using 8 and 16 and more chips at the same time. Something like for your byte: bit 1 -> chip 1, bit 2 -> chip 2, ... So instantaneous read is not actually reading 8 adjecent memory cells but 1 parallellized read)

•

nwallin 2 days ago

I wonder if/when we'll reach the point that it's cheaper to manufacture SRAM (with 6 transistors per bit if I recall correctly) than it is to manufacture DRAM. (with 1 transistor and 1 capacitor per bit)

The transistors get smaller every year. The capacitors, like you say, don't anymore. At some point those 5 extra transistors will be cheaper than the capacitor, unless Moore's Law well and truly bites it.

•

Nevermark 2 days ago

A CPU implementing C++ as a microarchitecture…? Finally, uncontrovertible proof of the prophesy. We really are living in a Cthulhu nightmare.

Simulation theory is dead.

•

sukuva 2 days ago

do we have "modern" 32-bit CPUs?

•

reactordev 2 days ago

Yes. Yes we do. A lot of them.

•

avadodin 2 days ago

That title got me:

Modern C++ CPUs as in LISP CPUs or as in Verilog CPUs?

•

zombot 3 days ago

Came here to say exactly that.

•

froh 2 days ago

what if a language would allow to elegantly pack Optional values?

so the physical layout has a bit vector with one bit for each optional. and a popcnt over that bitvector (masked up to the value we're interested in) will give the actual slot to look into?

would also make sense to reorder / bucket fields by (byte) size

if you want to do that in any low level language (rust, c++) you have to deviate from their standard syntax for optionals, and you have to manually keep track of slot order. but for domains with many optional/default values, this amy really reduce cache pressure, no?

In higher level languages you can fake the effect (with flyweight facades), so from python such a packed "dataclass"-like class can look neat and clean. however at the low level there is no abstraction that allows to create your own data layout.

at least I didn't find anything yet.

•

gpderetta 2 days ago

That's basically a AOS/SOA transformation (then packing the boolean valued array). You can have extremely cheap proxies in C++ so it is not much of an issue. The problem is that this proxy-optional wouldn't immediately interoperate with std::optional, but could interoperate with any generic code taking an optional-like value.

•

gpderetta 12 hours ago

froh: your reply was flagged for some reason! I tried to vouch it but it wasn't enough.

•

froh 9 hours ago

thank you

•

owlbite 2 days ago

I'm somewhat dubious about anything talking about low level performance programming at the instruction level that doesn't distinguish between latency and throughput, never mind mention the incredibly out-of-order nature of modern desktop/server class CPU cores.

•

wmu 2 hours ago

That's a very important point. For instance on Intel's CPUs multiplication is pipelined - its latency is 3 cycles, but throughput is 1 cycle. Thus completing N multiplication takes 2 + N cycles (in the best case), not 3 * N.

•

Blackthorn 2 days ago

Virtual functions cost a lot less here than I expected.

•

menaerus 2 days ago

I don't think you're wrong. Virtual functions is a two-pointer dereference operation (vptr, vtable[vptr]), and there we can have a d-cache miss but the main cost of using virtual functions is the increased likeliness of the i-cache miss. Cost of 30-60 cycles as per article assumes an icache-hit, and since virtual call is an indirect call (jump), it also heavily depends on the branch-target predictor and its buffer. I can easily imagine that iterating over a heterogeneous collection of objects would incur much larger cost than ~50 cycles/iteration. Branch target misprediction flushes the whole pipeline (15-20 cycles) and icache miss can easily end up being a fetch from main memory (200-300 cycles)

The article in general is interesting since it gives a rough idea of cost of operations relative one to each other but since CPUs are much more complex beasts it also gives us an incomplete picture, and if you're unaware of it the chance is that you will use it derive incomplete conclusions from it - understanding performance implications of a software running on an actual hardware is much more involved than what one article can fit.

•

notorandit 2 days ago

C++ CPUs?

•

brcmthrowaway 2 days ago

If you read this, you should be able to get a job at Jane Street

•

zombot 3 days ago

This looks like something that every serious C++ programmer should be reading.

•

rramadass 2 days ago

See also a 3-part article; Advanced C++ Optimization Techniques for High-Performance Applications here - https://news.ycombinator.com/item?id=48265690