Linux eliminates the strncpy API after six years of work, 360 patches

298 points by simonpure 4 days ago|322 comments

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

•

sirwhinesalot 4 days ago

I have in the past made fun of the Linux kernel devs, supposedly some of the best C developers in the world, for not knowing how to make stringbuffer and stringview types, but to be fair to them we didn't have the consensus we have today on the topic.

You know who did have the right idea though? Dennis Ritchie, who proposed a fat pointer type for C all the way back in 1990. Would have made for a perfect addition to C99. Imagine how different the world might have been had the committee added that in.

We had a second chance with the release of the "C's greatest mistake" blog article from Walter Bright in 2007, essentially pushing for the same idea as Ritchie (slices/stringviews) but explained with much clearer language.

Alas, didn't make it to C11.

We're now in C23, still nothing. But we did get _Generic and VLAs! Party hard.

•

ethbr1 3 days ago

> "C's greatest mistake" blog article from Walter Bright in 2007

https://digitalmars.com/articles/C-biggest-mistake.html

And because it came up in my search and the bikeshedding discussion made me chuckle, reddit on same: https://www.reddit.com/r/C_Programming/comments/90uq7c/cs_bi...

Am curious about this esoterica, if anyone can confirm/deny:

>> Speaking of [C] arrays decaying into pointers, does anyone know why this behaviour was designed in the first place?

>> It was so that B code could be compiled as C with minimal changes. The designer felt that this would encourage people to switch from B to C. In B an array declaration actually defined a pointer and an array, with the pointer initialized to point to the array's first element.

•

anal_reactor 4 days ago

> but to be fair to them we didn't have the consensus we have today on the topic.

This is my pet peeve of teamwork. We can choose solutions A, B or C. Each has upsides and downsides. We debate for two weeks, then we choose nothing.

•

david-gpu 3 days ago

Isn't that due to a lack of leadership, rather than a problem with teamwork itself? Somebody has to be ultimately in charge and willing to put a stop to endless debate.

•

gbin 3 days ago

It is more like a "design by committee" issue to me. The decisional structure needs to be built for some opinionated decision. When you have a committee there is not one clear thing they are solving for as everyone has an agenda to tug the language towards their own interests. The result of that can certainly be inaction because it is easier to say no than yes.

•

flohofwoe 4 days ago

> But we did get _Generic and VLAs! Party hard.

VLA has been demoted to an optional feature in C11 (good).

IMHO the current main problem is that the C stdlib is stuck in the K&R era and the stdlib APIs haven't even been updated to the language features added in C99 (e.g. make use of struct args and return values). A range struct (ptr/size pair) in the stdlib and new or updated string functions to use such ranges would already go a long way.

•

VorpalWay 4 days ago

> IMHO the current main problem is that the C stdlib is stuck in the K&R era and the stdlib APIs haven't even been updated to the language features added in C99 (e.g. make use of struct args and return values).

C++ has the same issue (only with more chaos and bloat). They add some new good idea (like optional) but don't update the rest of the standard library to make use of it. And they can't really, without breaking backwards compatibility.

I think looking at the edition system in Rust could be useful for C and C++ to start to solve this. Something like "If this source file has this pragma in it, compile that code with a new edition". It would have to be granular, per expression really, to handle macros (which is how it works in Rust too). What would it change in C/C++? Name resolution, you could get a different set of resolvable overloads, depending on which edition is active in the caller context. Not unlike an enable_if.

The same would work in C: depending on the caller edition, expose function signatures.

•

steveklabnik 3 days ago

Someone tried to propose this in C++ a few years ago, but it never made it in. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p18...

•

fulafel 3 days ago

Link to Ritchie's proposal: https://web.archive.org/web/20150611114358/https://www.bell-...

•

pjmlp 4 days ago

That only goes to show where WG14 priorities are.

•

HackerThemAll 2 days ago

I have used the DJB's stralloc library a lot. Not a single vulnerability, binary string safe. Good enough for 99% of applications.

https://cr.yp.to/lib/stralloc.html

•

WalterBright 4 days ago

"The strncpy function within the Linux kernel has been a "persistent source of bugs" for years due to counter-intuitive semantics and behavior around NUL termination along with performance issues due to redundant zero-filling of the destination."

Huh. Whenever I've been asked to review C code, I always looked for strncpy and always found a bug with it.

•

rswail 4 days ago

Things that have bugged me for 40 years...

* NUL terminated strings (and now, non UTF-8 encoded strings on input/output)

* Using LF or CR or CRLF as line terminators, and pipe/comma-delimited fields when there were other unambiguous ASCII characters that could have been used (eg, GS, FS, RS) that would have made the encoding/decoding of line termination an I/O thing keeping HT/VT/CR/LF/FF as literally print related codes.

•

EvanAnderson 4 days ago

I did a project to translate data framed in the ASCII field/record separator characters and it was gloriously easy. All the ugly escaping considerations with comma-delimited data went away and it became much easier.

•

ptx 3 days ago

What happens when the data contains the record or field separator characters?

I suppose you could document that it's unsupported, and just drop or reject such values, but then the system couldn't be used to handle test data for such systems, for example.

•

EvanAnderson 3 days ago

In the case of this system (a quasi-EDI interface used to move records from a fleet fueling point-of-sale system to the ERP software) those characters were forbidden by the source application. My code would have exploded in a fireball if they had been present, but the specification said they couldn't be.

•

rswail 3 days ago

If it's purely binary data, then you can't.

Otherwise you need to have some sort of escape mechanism, exactly like quoting strings in CSV. In fact, there's an ASCII code "ESC" for entirely that purpose. :)

The problem is that those characters are non-printable, which means if you're just dumping the file out somewhere, you can't see them.

•

a96 3 days ago

Same as any separator. Either it's not in acceptable set of non-separator input or there's an escape (that can also escape itself for the literal).

•

bobmcnamara 3 days ago

Easy - don't

•

brewmarche 3 days ago

Now with Unicode we actually have even more:

NL Next line (from EBCDIC?)

LS Line separator (invented by Unicode)

PS Paragraph separator (same)

The Unicode standard says that in addition to CR, LF, CRLF and the above, vertical tabs and form feeds should also be treated as line separators.

•

flohofwoe 4 days ago

> non UTF-8 encoded strings on input/output

UTF-8 on stdin/stdout works perfectly fine (unless you are on Windows of course, which is stuck in in the early 90s when it comes to international text encoding).

> Using LF or CR or CRLF as line terminators

This is also an operating system convention, and it would be better if programming languages wouldn't try to "guess" the correct line endings, since this causes more problems than it solves - but again, this is mostly a Windows specific problem, and it's Microsoft's job to finally bring Windows into the current century.

•

rswail 4 days ago

No, it was an Apple, Unix, and Microsoft problem.

Unix used LF, Apple used CR, Microsoft used CRLF.

They are all ASCII carriage movement codes, which is about driving the paper feed and print head of an ASR-33 or equivalent.

So they all made the "wrong" decision about what to store in a file.

They just chose different wrong characters.

•

kps 3 days ago

> They just chose different wrong characters.

Unix followed Multics. Multics chose right. ASCII/EMCA-6/ISO646 drafts discussed this at least as early as 1963¹: “For equipment which uses a single combination (called New Line) [...] NL will be coded at FE₂ [Field Effector 2 = 0x0A].”

¹ doi/10.1093/comjnl/7.3.197

•

rswail 3 days ago

For an OS that was being created specifically to process text, having the equivalent of CR being separate to LF to allow for overprinting would/should have been a requirement.

I'd say Multics/Unix was technically correct, except this was still the wrong decision for I/O ever since.

The Record Separator is the logical character code to use to indicate the end of a line of text and print position characters, assuming that a line of text is a "record".

•

flohofwoe 4 days ago

> Apple used CR

Apple hasn't been using CR since the release of OSX (26 years ago). Microsoft could have made the switch at any time too (just as they could have switched to UTF-8 as universal text encoding on Windows), they just choose not to.

In the end it's not the job of programming languages to clean up Microsoft's mess ;)

•

rswail 3 days ago

We're literally talking about two decades before that.

•

bobmcnamara 3 days ago

The switch sure sucked though. I doubt Microsoft would risk their reputation for backwards compatibility.

•

parineum 3 days ago

> In the end it's not the job of programming languages to clean up Microsoft's mess ;)

Why is it Microsoft's fault? They just stayed on their legacy implementation, Linux and Apple chose to move from the legacy implementation to another legacy implementation. That seems dumb.

•

canucker2016 3 days ago

I think PCDOS/MSDOS copied CP/M's use of CRLF for line separator.

Some believe Gary Kildall picked CRLF for CP/M since he used DEC TOPS-10 to develop CP/M. see https://www.quora.com/Why-did-CP-M-stick-with-the-CR-LF-stan...

•

Parodper 3 days ago

UNIX's LF precedes them by at least half a decade, probably more.

•

bobmcnamara 3 days ago

CRLR is Baudot, predating UNIX by what, a century ?

•

Parodper 3 days ago

Apparently it's from 1901 (Murray code) or 1932 (ITA2).

The fact that both Apple's and CP/M codes came out roughly at the same time, both on microcomputers, shows that it was probably just a design decision.

•

JdeBP 3 days ago

rswail said ASCII, which definitely pre-dated Unix, not the other way around. And there was some to and fro about the equivalence of LineFeed and NewLine in the 1960s.

•

skywhopper 3 days ago

What programming languages try to guess line endings? Or are even aware of them?

•

flohofwoe 3 days ago

Ok, technically not the programming languages, but their stdlibs. On MSVC at least, opening a file in text mode via fopen will translate CRLF into LF on read, and LF into CRLF on write, which has been a neverending source of confusion since at least the 1990s.

•

Parodper 3 days ago

LF makes the most sense, but they're all fine for text files. The issue is that CSV isn't text.

Last time I had to handle CSV files in bash, I converted them internally to RS and FS.

•

Ekaros 3 days ago

Line feed resetting position really makes no sense. It should just continue text from where the cursor was but on next line. Like staircase. You need CR to go back to start.

•

a96 3 days ago

It makes perfect sense when you consider text files. When line ends, next line obviously starts from column zero.

CR is the only wrong choice. There's never a reason to go to start of line without erasing the line or moving to next line in a file. And even user interfaces will have smarter ways to do that. It's a completely useless concept outside of typewriters.

Well, CRLF (or worse, LFCR) is also obviously a wrong choice because it's pointless to demand two characters and create problems when one of them is missing when one totally unambiguous character will do.

•

Parodper 3 days ago

Yes, if you're talking to a terminal. But an in-disk file doesn't have a carriage to return.

•

ryandrake 3 days ago

Modern computer text output devices don’t have a “carriage” or a “feed” mechanism. I’d argue both CR and LF are legacy, anachronistic characters whose purpose was too device specific to make sense as a text encoding.

•

kps 3 days ago

Sure, there's a text encoding part and an equipment control part that puts the CII in ASCII.

•

codedokode 3 days ago

> non UTF-8 encoded strings on input/output

I would just use UTF-8 everywhere.

•

rswail 3 days ago

Storing them as 32 bits wide in memory means you can at least index by a codepoint (if not a glyph).

•

codedokode 3 days ago

I think you rarely need it. May I know what is your usecase that you need this often?

•

lambdaone 4 days ago

This sort of boring grind is where the real work of systems engineering is done. Big infrastructure projects like this work on making the Linux kernel more reliable while still keeping it workable throughout the process move on the scale of decades, not months.

•

appplication 4 days ago

On one hand I understand why it’s decade scale (the long tail of users/dependencies is really, really long) but on the other hand it doesn’t feel like a tenable pace at which we can make meaningful long term progress. Less of a gripe and I guess more a paradox of critical infrastructure.

•

thiht 3 days ago

> In place of strncpy, Linux kernel code should use strscpy() for NUL terminated destinations, strscpy_pad() for NUl-terminated destinations with zero-padding, strtomem_pad() for non-NUL-terminated fixed-width fields, memcpy_and_pad() for bounded copies with explicit padding, or memcpy() for known-length memory copies

What a nightmare, does it have to be so convoluted?

•

MarkMarine 3 days ago

Performance. A safe Swiss Army knife function that did most of this would be slow because of the internal branching you’d need to be safe, and because there is developer intent in the selection of these functions. I’d rather have the choice and clear dev intent when I see the function used when reading code.

•

bobmcnamara 3 days ago

> does it have to be so convoluted?

Getting strncpy right always has been

•

hahn-kev 3 days ago

Couldn't they at least give them better names?

•

twothreeone 4 days ago

wow, very humbling. I'm actually amazed how many people contributed to this. It's easy to get attribution for "cool new features", but arguable removing bad features is even more important for something as fundamental as the kernel. Cudos!

I'm sure these are the sorts of things that will go down as folklore from the "founding ages", when everyone will have forgotten how to understand source code in 50 years and the Claude/Codex cruft just silently keeps piling on and burning the majority of our planets energy.

•

skywal_l 4 days ago

Reminds me of Deepness in the Sky (Vernor Vinge) where a guy maintains a ship by doing software archeology. He is the only guy who knows what the Unix epoch is.

•

zaik 3 days ago

> everyone will have forgotten how to understand source code in 50 years

I don't think this will happen. Human desire to understand how things work will still be around in 50 years.

•

nuc1e0n 4 days ago

I'm of the opinion AI slop code will become untenable way before that.

•

bigstrat2003 3 days ago

I'm of the opinion it already is. But it seems that many people have yet to reach the point of being fed up with it.

•

mrlonglong 4 days ago

the zero terminated string is I think is computing's biggest mistake. Pascal style strings were much safer.

•

bsder 4 days ago

Zero terminated string is a special case of sentinel value termination.

And sentinel value terminations make a lot of sense when you have punch cards and fixed length records that you need to carve into pieces.

Nobody expected any decisions they were making in the 1960s and 1970s to have any bearing on computing a half-century later. They all expected to have their mistakes long papered over by smarter people at some point.

But we ALL make the mistake of underestimating inertia.

•

layer8 4 days ago

There is a middle ground that Visual Basic (and then COM) took, with the BSTR type: It’s still a pointer to a zero-terminated char array, but there is a length field immediately preceding the first pointed-to byte. This is still compatible with a C string (assuming no embedded null characters), but BSTR-typed functions can take advantage of the length value.

•

tremon 3 days ago

> This is still compatible with a C string

Strictly speaking, it's not alignment compatible from CString to BSTR unless you declare all strings to be at most 255 characters or the cpu architecture doesn't require aligned access for multi-byte words (like x86). The BSTR alignment must match the alignment of the length word, meaning you can't convert a randomly-aligned C string to BSTR by simply attaching a prefix in-place.

Also, having the length embedded in the value rather than in the pointer makes it impossible to create BSTR (sub)slices without performing a memcpy. Fat pointers do not have this restriction.

•

layer8 3 days ago

A BSTR object is compatible with functions expecting a C string. The other direction obviously never holds, unless the C string is a BSTR to start with.

Yes, there is a trade-off between slices using the same format and having compatibility with C strings. Hence “middle ground”.

You can still use a string-slice type on top of BSTR, it just would be a separate additional type. Note that languages like Java also don’t have a singular type for strings and string slices.

•

jiggawatts 3 days ago

These are great for "data smuggling" attacks where one layer of code assumes the length is 'x' and another layer assumes it is 'y'.

It makes hybrids like this very dangerous for anything even remotely security-adjacent, such as roles, tokens, etc.

This kind of thing caused the CVE-2009-2408 and CVE-2009-2510 "Null Truncation in X.509 Common Name Vulnerability."

•

badsectoracula 3 days ago

FWIW pretty much all Pascals since the 90s use a similar approach.

In Free Pascal for instance, strings are pointers to the first character with a header in a "negative address" containing information about the length, reference count (strings are reference counted and use copy-on-write to avoid passing around copies all the time) and codepage (FP can convert strings between different encodings "transparently").

•

pjmlp 3 days ago

Nowadays replaced with HSTRING.

•

BobbyTables2 4 days ago

Partly agree but there would have been squabbling on the data type of the size, unless it was variable length. The latter would have had other issues too.

For a while, 16bit would probably have seemed too extravagant. Now 32bit would probably seem too small.

For a “strongly typed” language, C is pretty damn loose where would have mattered.

•

DarkUranium 4 days ago

I like the D approach where arrays are just `struct { size_t length; T* ptr; }` internally --- and strings are just arrays of `immutable(char)`.

It has a big advantage over the Pascal approach in that you can do zero-copy slicing, since the length is separate from the actual data.

And `size_t` makes perfect sense for the length here. If your strings are longer than the address space (which `size_t` technically isn't, but is practically very strongly correlated to it), then you're going to have a problem regardless of the number of bits for the length anyway.

•

astrobe_ 4 days ago

This only makes a difference in terms of memory size, not in terms of speed, because for decades processors and compilers have been optimized for moving bytes around.

But one would note that in order to gain memory for this particular case of slicing, one introduces 2 extra words (size and pointer) for every other cases. Like perhaps the second most common string operation, concatenation. In those other cases, the benefit is slightly negative.

I've had extensive experience with "counted strings" because I implemented a bunch of Forth interpreters which also uses this scheme. Including the common trick of using counted and zero-terminated strings, which is the worst of both worlds in the end. Forth is the kind of language that quickly show you how bad your choices are.

I eventually dropped all that and adopted ASCIIZ strings because they are generally more efficient (if you pay attention to the strlen() performance pitfalls) and having a dead simple interface with the rest of the world (OS, libraries) is more valuable.

•

darkr 3 days ago

C is a weakly typed language. It’s statically typed, which is a different thing

•

poly2it 4 days ago

No, there would not have been and this is most likely not the reason. size_t exists for precisely this use case. It has existed since C89.

•

zeroonetwothree 4 days ago

C is not really strongly typed.

•

tialaramex 4 days ago

"Strongly typed but weakly checked"

It turns out that the machine is much better at the sort of boring mechanical tasks where thoroughness counts and imagination doesn't and so languages which do more, and more, and more checking pay off very well. Rust's borrowck is the obvious first thought today but say WUFFS will check that you've proved certain key properties, WUFFS doesn't need to insert runtime bounds checks for example because you've proved, before the code would compile, that you don't have any bounds misses. You might have proved it by writing bounds checks yourself of course, or likely you have an inherent mathematical rationale for why your algorithm has no misses, but either way the compiler checked your work.

•

imtringued 4 days ago

This is something that has irritated me for a long time.

Bounds checks and sized arrays and strings are mechanically very easy to perform by a machine. These are highly automated tasks.

There are some extreme cases where they ruin performance, but in the vast majority of cases they don't matter.

If you look at the type of tasks that cannot be automated, if going from no to full automation required an efficiency loss of 5%, most people would see taking the hit as an obvious choice.

And this is where the problem becomes recursive. You can build a language where the runtime check becomes a compile time check.

We ought to abandon the C paradigm of shifting all the work to the developer and shift more work to the machine.

•

pitched 3 days ago

Anyone willing to accept an efficiency loss is already using an interpreted language and doesn’t have this class of problems.

•

tialaramex 3 days ago

> Anyone willing to accept an efficiency loss is already using an interpreted language and doesn’t have this class of problems.

Linux already takes several of these "efficiency loss" choices in C. The insistence that "actually I never make this mistake" has to be the surest sign that you're not talking to a real engineer across our whole industry. I associate it most with Bjarne Stroustrup, a man who has written a lot of books and papers but no notable software since his "cfront" C++ transpiler decades ago.

And besides all that, WUFFS isn't even taking an "efficiency loss" - remember it isn't emitting bounds checks it just checks that you proved you don't have bounds misses.

•

bigstrat2003 3 days ago

I don't think 32 bits for the size would be too small. That would max out at a 4GB string, which is large enough that even today it's a big red flag saying "what are you doing bro, reconsider your approach". I can conceive of a string larger than 4GB, but I can't conceive of a situation where it's reasonable to use one.

•

smackeyacky 4 days ago

Zero terminated strings were the basis for an awful lot of useful software. Calling them the biggest mistake in computing is a bit OTT.

I haven’t programmed anything Pascal related for 30+ years but I dimly remember thinking at the time that I wished the string system wasn’t so hard to use.

•

asdfasgasdgasdg 4 days ago

That useful software would not have been less useful if the strings in it were represented as size + buf.

•

crackez 3 days ago

Oh really? Have you tried to rewrite anything to put your theory to the test? I don't think it's as straight forward as you think it is...

•

smackeyacky 3 days ago

Exactly. The pascal I used had no way to dynamically allocate a string they were all fixed at compile time. That really sucked.

•

ComputerGuru 4 days ago

That argument isn’t valid. The argument would be “this string design enabled a whole lot of useful software” but that’s a different matter. (And it could very well be the case.)

•

zzrrt 4 days ago

Lead was the basis for an awful lot of useful gasoline. Doesn't mean it was the only solution or the best one.

•

JdeBP 4 days ago

A more accurate re-phrased version of the original is that they are the biggest mistake in the C language.

* https://news.ycombinator.com/item?id=48614913

* https://news.ycombinator.com/item?id=24454369

* https://news.ycombinator.com/item?id=1014533

•

saagarjha 3 days ago

zero terminated strings may cause a lot of bugs, but it also ends up in a lot of useful software, so, it;s impossible to say if its bad or not,

•

dmazzoni 4 days ago

255 characters ought to be enough for everybody, right?

•

RetroTechie 3 days ago

You mean bytes.. we have multi-byte characters now.

•

Conscat 4 days ago

Clang and GCC both let you use Pascal strings in C if you would like (with `\p`). But Pascal strings aren't that useful today because the maximum length is too short.

•

jxbdbd 4 days ago

Why would a pascal string be any shorter than a C string?

A C string is one pointer reaching all of memory, a Pascal string is two pointers reaching all of memory

•

bc_programming 4 days ago

A pascal string is a single byte with the length, followed by the data.

Some implementations use more bytes for the length data, such as Delphi which changed over to a 4 byte prefix length, though those aren't technically Pascal strings anymore. I can't find anything about a Pascal string being two pointers?

•

jll29 4 days ago

It is conceivable, for both Pascal and C, to have more than one string implementation side by side, so the developer can choose to use the best-fitting one.

In C++23, variant<> permits to do what Rust's typed enums introduced (e.g. Result sum type that is either a "real" result - with result type - or an error - with error type -, each strongly typed).

If you do that, a definition like

  class IString { /* basic string functions */ };

  class MiniString : public IString {};
  class CZeroTerminatedString : public IString {};
  class PascalString : public IString {};
  class CppString : public IString {};

  use String = std::variant<MiniString, CZeroTerminatedString, PascalString, CppString>; // define one type for all impl.

permits to define string functions that operate over the sum type String, and which use the methods defined in the interface IString, and which then work for all string implementations.

The developer can then pick the most suitable implementation, i.e. CMiniString for very, very short strings (that fit into 64 bits, so approx. <= 8 UTF-8 characters), CZeroTerminatedString (for char *co = "test\n"; zero-terminated old style C strings), CPascalStrings for strings that carry a length in s[0] or as a struct member or class field, and CppString as a wrapper for the C++ std::string that implements IString.

Sum types are a type-safe and memory-preserving way to do what in the older days was sometimes implemented using a "union {}" (which was not type-safe).

•

badsectoracula 3 days ago

Free Pascal strings (and i assume Delphi as they are sometimes compatible) are pointers to the first character of a null terminated string with a header in a negative offset before the first character indicating the string's length, reference count and codebase.

AFAIK a "Pascal string" is basically another way to say "length-prefixed string" (as opposed to null terminated string) and Free Pascal (and Delphi) are like that (and they're Pascal dialects too, so their strings are literally Pascal strings :-P).

•

Conscat 4 days ago

A Pascal string has a leading length byte. Because that is one byte, the text can't exceed 255 characters.

•

t-3 3 days ago

For modern hardware, a 64-bit length is more practical though - no alignment issues. It seems to me that Pascal's specifying a single byte prefix was a language design "mistake" of the same type as NULL termination, putting hardware considerations into the language definition. Very practical for machines of the time, but not necessarily the best choice in hindsight.

•

badsectoracula 3 days ago

The original Pascal didn't had a string type, that was introduced by various dialects.

FWIW all Pascal dialects since the 90s have a string type that allows more than 255 bytes. In Free Pascal strings are pointers to the first character with a header in a negative offset indicating the length, reference count and codepage (these fields are aligned depending on the CPU). For C compatibility the string is also null terminated so you can pass such a string to a C function and it'll work as expected. AFAIK Delphi also does the same.

•

robocat 3 days ago

> single byte prefix was a language design "mistake"

Easy to say in hindsight.

It was an optimisation made back when every single byte mattered because you might have some kilobytes of memory and a 6502 CPU (where you strongly avoided using 16 bit pointers or arithmetic - because your program would be too bloated otherwise).

At the time Pascal was used, a whole byte for each string was seen as a waste - so fixed length strings were often used instead.

•

a96 3 days ago

C strings (sentinel terminated strings) are infinite. Anything less than infinite is shorter than that. Anything with a known length is shorter than that. This makes them generic. Same type of constraint as you see in many algorithms.

•

jackbucks 4 days ago

It was definitely an interesting way to allocate pointers. I did once have a very large project where devs didnt understand this and resolved hundreds or more off by one and memory overwrites in C due to this feature.

But at the same time, I think blaming the software was kind of a cop out. Devs were in a hurry and simply didnt respect the rules. Given todays software engineer at large. Nerfing programming languages so they cant destroy things might not be a bad idea. But AI will nerf everything.

•

fragmede 4 days ago

why is AI gonna nerf everything? sure it could be used as the easy button, but I just spent two hours this morning learning about the neuroscience of how memory works in the brain that I didn't mean to and now I want to run studies on how memory works.

Why do you assume that AI is gonna nerf everything?

•

AnimalMuppet 4 days ago

AGI might. AI? No way.

See, AI was trained on existing data - on all that existing C code out there (sure, and also on all the papers and articles saying what was wrong with that C code). Those bugs are in the training data, and often not marked as bugs. So when AI generates C code, is it going to avoid making the mistakes that human code made? No, it's going to generate the kind of code it was trained on. How could it be otherwise?

That's not going to nerf anything.

•

deathanatos 4 days ago

> See, AI was trained on existing data - on all that existing C code out there (sure, and also on all the papers and articles saying what was wrong with that C code). Those bugs are in the training data, and often not marked as bugs. So when AI generates C code, is it going to avoid making the mistakes that human code made? No, it's going to generate the kind of code it was trained on. How could it be otherwise?

The generalization of this is why I think all these AI companies writing blog posts where the marketing department is just jer—ranting endlessly about how AI will improve itself into the singularity is just crazy talk. They generate a random statistically likely output, and the most statistically likely output is mid. Exceptional outputs — the ones that wow us or move the needle are exactly that, unlikely. AGI is sci-fi, and LLMs will not change that.

You can see the same effect when AI emits bash, too, and especially so since most bash is terrible, and most users of bash do not put in the effort to learn bash and its foibles. So it outputs what most people write, which is not great.

•

AnimalMuppet 4 days ago

It still could happen, if they had a way to judge the exceptional outputs from the mid and terrible ones. But I'm not sure they have that...

•

ComputerGuru 4 days ago

In far from an AI fanatic, but I would argue training it on GitHub PRs and general software patches already provides that. Instead of just seeing the static snapshot it sees “this code was replaced by this (hopefully better) code”

•

CamperBob2 4 days ago

When's the last time you saw a decent coding model create a buffer-overflow bug while trying to use C strings?

Serious question. Anyone else seen this happen in the last 12-18 months? If so, which model and version were you using?

•

smackeyacky 4 days ago

I had Claude write a bit of stupid C# the other day that had an off by one string truncate. Surprised the hell out of me.

•

macintux 4 days ago

Would you even know? Serious question. The volume of code the models can produce, the subtle ways these bugs can manifest (or even only manifest when under attack), it seems like they would be easy to overlook.

•

CamperBob2 4 days ago

I have a habit of getting GPT 5.5 to review everything Opus writes for me, and vice versa. The model in the reviewer role frequently finds things I overlooked myself. Occasionally in parts of the code I wrote.

No modern LLM has found any buffer overflow bugs in parts of my code that originated from another LLM. Again, though, they have found one or two that were my fault.

•

bigstrat2003 3 days ago

Having one clanker verify the output of another has minimal, almost non-existent, persuasive value as evidence.

•

CamperBob2 3 days ago

Hopefully you're close enough to retirement age that it's a moot point for you.

If not, you're headed for a bad time, a major attitude adjustment, or (most likely) both.

•

smj-edison 4 days ago

I use Zig, which has slices, so so far none. But man, it can't get ref counting right to save its life. There have been remarkably few times it's gotten it right on the first try. My codebase considers OOM recoverable, so it keeps forgetting to clean up memory when OOM is raised. Even in the happy path though it still messes up ref counting. I use Kimi k2.6.

•

krupan 4 days ago

How many people are writing C code with LLMs? I get the impression it's mostly JavaScript web apps

•

CamperBob2 4 days ago

All the time. C, C++, occasionally some VHDL or Verilog.

•

layer8 4 days ago

Almost as bad as newline-terminated lines. ;)

•

sourcegrift 3 days ago

What's bas about them and what are the alternatives, genuinely curious since never seen them spoken of

•

msla 4 days ago

In addition to having to pick a size for the length counter and then, later, having to differentiate between lengths in bytes, codepoints, and glyphs, you can't subdivide a Pascal string using pointer arithmetic. To pass just the end of a string into a function, you have to either copy the tail of one Pascal-style string to another with a smaller size value, or your string has to be a struct with an integer and a pointer to the actual data instead of just an integer stuck on the beginning of the string. The first is a lot of copying in some cases, the second raises the specter of structs with invalid pointers. That's not to mention the potential problems that would cause with caches.

•

cornholio 4 days ago

You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc. On the critical short string path, it costs just a single bit test. The glyph vs byte issues need to be dealt with in both formats.

The subdivision issue is a good perspective, but i would argue the performance impact of cloning substrings is dwarfed by the redundant full string reads to find length.

•

lelanthran 4 days ago

> You can have a universal variable length field, for example 2 bytes for strings < 32768, then four bytes, 8 bytes etc.

To hold the length of a string, I'd do something similar to unicode:

7-bits for size + 1-bit for continuation, then 15 bits for size + 1 bit for continuation, then 23-bits for size + 1 bit for continuation, etc.

Or maybe even do it exactly the same as unicode:

    0XXX XXXX -> length of string is in those 7 bits
    1XXX XXXX  XXXX XXXX -> length of string is in those 7+8 bits
    11XX XXXX  XXXX XXXX  XXXX XXXX-> length of string is in those 6+8+8 bits
    ...

> On the critical short string path, it costs just a single bit test.

A few more clock cycles compared to NULL-termination, although my alternatives above require even more clock cycles.

If the hardware had instructions for sentinel values, things would be easier (Like how DOS calls used '$' termination for strings) and safer.

Load a sentinel byte into a register and have dedicated copy and compare instructions that take each two addresses (src and dst) and copies (or compares) src/dst until the terminator is reached (with copy copying the sentinel as well).

Considering that sentinel values are needed so often, and are so useful, it's surprising that this is not in any ISA. What we have now is kludgy workarounds in the HLL for this. It's hard to blame the HLL, because some workaround has to be implemented.

•

cornholio 4 days ago

Personally, I would avoid UTF-8 levels of complexity because you only pay the size cost once per string. A simple 2-bytes + optional 4 bytes continuation scheme handles strings up to 140TB and increases the size of the average string by just 2 bytes (compared to 1 byte for nul termination).

•

rswail 4 days ago

So what exactly is the NUL/0 in the code below other than a sentinel value?

    while (*d++ = *s++)
        ;

•

lelanthran 4 days ago

I am not sure what this is in response to. Can you explain which point of mine you are responding to?

•

rswail 4 days ago

> If the hardware had instructions for sentinel values, things would be easier (Like how DOS calls used '$' termination for strings) and safer.

A zero is a sentinel value and is catered to by all ISAs.

Why would using a "$" be any easier/safer than a NUL?

•

lelanthran 4 days ago

> A zero is a sentinel value and is catered to by all ISAs.

> Why would using a "$" be any easier/safer than a NUL?

I didn't say it had to be '$'; I specifically said that the sentinel would be loaded into a register. In that case it could be anything, including zero (for the snippet you posted), or INT_MAX if the code iterated across an array of integers, etc.

By having rep/mov variants that use sentinels, a lot of the HLL problems go away - Java, C#, Python, etc would all look very different today if the ISAs from the 80s included sentinal variant of memory instructions.

•

rswail 4 days ago

Except that nearly all ISAs treat zero as a special value, with a Z-flag or equivalent for the last ALU result, and conditional branches around that result.

PDP-11s, 68Ks, nearly all ISAs that I know about treat zero as special.

It falls naturally out of the ALU operations.

So why would people writing assembler code use another value unless they had to?

•

lelanthran 4 days ago

That's my point - they didn't, and used the zero as a sentinel when designing their HLL.

If, OTOH, the ISA had additional variants of those instructions that allowed usage of anything as a sentinel, HLL implementations of array would never have needed a fat pointer (length + memory).

•

rswail 3 days ago

Except that the ISA has a perfectly good ALU there that can detect zero really easily, so no one was going to waste silicon on an instruction that required comparison to yet another value (which essentially would be an additional subtraction or OR or equivalent) added to the loop.

The fat pointers are much more efficient in that you don't need to scan memory to get the length or find the end to append or take slices etc.

Especially for vectors that don't have any value that can be used as a sentinel.

•

Parodper 3 days ago

You could do 0xffff as a special case, and put another length+string/pointer to after the 255th byte.

•

estebank 4 days ago

The third option is to have a variable width length: the top most bit signals whether the next byte corresponds to the length or to the start of the string.

•

pjc50 4 days ago

.. which is why you need a second type, the one dotnet calls "Span". A substring.

•

themafia 4 days ago

> Pascal style strings were much safer.

The limitations were brutal. Initially you could only have 255 bytes in a string. The length of a string and the size of the allocation are now separate and you may need to think about that unused memory in your design. The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.

If you want to create an array of strings you either need to specify the length of all strings and accept the memory overhead or have an array of pointers to strings. If you use an array of pointers you may end up choosing to use the 'nil' value as a sentinel that means "end of list." So we're right back where we started.

Because someone decided to downvote this HN has limited the speed at which I can reply. This site is tragic and I'm fully done with it now. You can spread propaganda and poorly sourced zeitgeist and be among friends but if you try to have a genuine conversation about programming languages you are made to be unwelcome immediately. Screw this.

> No other data structure works like this.

The linked list.

> You can't mess this up in an array

C happily decomposes arrays into pointers. You can erase your length information from the type. This was an intentional decision.

> Strings are the only data structure that assume there will be a NULL at end.

Which is why almost every string API has a version that allows you to specify the maximum length. The fact that you can use a NUL doesn't mean you have to. Which is why the concept of "sentinel values" is broadly used in many types of applications you haven't considered here.

•

dare944 4 days ago

> You can spread propaganda and poorly sourced zeitgeist and be among friends but if you try to have a genuine conversation about programming languages you are made to be unwelcome immediately.

Indeed. And the ignorance of computing history in this discussion is particularly disturbing.

The context of this particular thread is "zero terminated string is ... computing's biggest mistake". This completely ignores the situation on the ground when C was developed. At the time, people were striving for a system programming language that sat above the level of assembly but was compact enough to run within the limited resources of the then emerging mini-computer systems. The PDP-11 on which C was developed was certainly not the first mini-computer, but it was among the earliest to have a regular enough instruction set and addressing model to make a general purpose, high-level system's language possible. These systems were extremely limited in memory; the PDP-11's instruction set is limited to directly addressing at most 64KiB (code and data) and many systems of the era were hardware limited to less than that. (Indeed, I regularly run an early version of Unix, including an early C compiler, on my PDP-11/05 which is maxed out at 56KiB [of actual core]). There was no way that even a brilliant engineer like Dennis Richie was going to be able to shoe-horn in "optional" types, or the mechanics of length-value strings into a compiler that has to run in such limited space, and produce code (e.g. the Unix kernel) that has to run in even less. The fact that strings and arrays are thin abstractions on top of pointers is both a brilliant compromise in design as well as a nod to then-prevalent assembly practice. It was the exactly kind of pragmatic decision that was needed to move computing along at the time. Of course the designs from this era are antiquated now. But they were not mistakes.

•

pwg 3 days ago

> This completely ignores the situation on the ground when C was developed.

A great many of those replying are many years short of having experienced anything like "the situation on the ground when C was developed". They simply have never known a day without hundreds of gigabytes or more of disk storage and 8G or more of RAM available for user processes after the OS consumes what it needs for its own work. They are "ignoring" because they simply have no basis for understanding.

•

rswail 4 days ago

The C code for strcpy is:

    while (*d++ = *s++)
         ;

On a PDP-11 that is:

    L:  MOV (R1)+, (R2)+
        BNE L

•

pjmlp 4 days ago

All those limitations were sorted out in 1978 with Modula-2 and open arrays, aka spans.

What about the UNIX and C folks propaganda of C being the first systems language, or always focusing on the original Pascal used for teaching and not everything else that followed up with Mesa, Modula-2, Ada, Object Pascal and friends, none of them with said limitations.

•

rswail 4 days ago

C was specifically developed to allow Unix to be ported.

It was a systems programming language and the first well known/successful one.

There was BCPL and then B before that, which is why the language is called "C".

Pascal was considered a teaching language, along with "Algorithms + Data Structures = Programs" by Wirth etc.

The UCSD P-system was one of the first "IDEs" and used Pascal and a bytecode interpreter of the compiled code.

Modula-2 was barely available in the early 1980s.

Ada was mired in MIL-SPEC and expensive compilers etc.

People used FORTRAN for scientific programming, C for most everything else in the non-IBM mainframe world.

•

pjmlp 4 days ago

Successful systems languages trace back all the way to JOVIAL in 1958.

You missed quite a few between JOVIAL, and C being adopted outside Bell Labs.

Modula-2 was as widely available as C was outside UNIX and universities with access to UNIX source code.

It took a while for proper C to actually be "used for everything else", until the early 1990s actually, and by then anyone sensible would be much better with Typescript for C, aka C++.

•

rswail 4 days ago

I dunno, I was starting my career around 1980-81, and the choices were 6502/6509/Z80/8088 asm, C, UCSD-Pascal, and BASIC at the micro level, C/asm was the rule for RT-11/RSX-11 and then the VAX OSs at the "minicomputer" level.

I had a friend that tried to get everyone using Modula-2 but the "ecosystem" wasn't as great around the uni/ex-uni environments where I was.

C was pretty entrenched by the end of the 1980s, although I did use a weird embedded Pascal that was on HP-UX cross-compiling for Z80/8086 at the end of the decade, but they were the exception rather than the rule.

C++ was just a preprocessor for C and a "better C" at the time, people were still bitching about header files with function type signatures of "ANSI C" vs "good old-fashioned K&R".

We also tied onions to our belts...

•

pjmlp 3 days ago

Alone the fact that you mention RT-11/RSX-11 and VAX OSs shows we were not on the same bubble.

Anyway on VMS most folks would be found using the VMS BASIC compiler, VMS Pascal or Bliss, until Open VMS made it yet another UNIX clone.

My first C compiler used the RatC dialect, let alone having access to a proper K&R C compiler.

By 1992, I already had access to a proper C++ compiler on MS-DOS, and C was history to me, other than scenarios were work was expected to be done in C like some university assignments, even here we were blessed with a plethora of languages between Lisp, Prolog, Smalltalk, ML, C, C++,....

•

rswail 3 days ago

My bubble was obviously technically superior to your bubble. :)

And maybe it was 5/10 years earlier? Not sure. My uni days were the very early 1980s. Our university literally still made 1st years use marked sense (not even punch) cards.

I never ever liked C++, it always seemed to be tacked on to the side of C (literally at the start).

I liked the "better C" bits, but the "++" bits and the magic under the covers and then later the added layer of templates just seemed ridiculously complicated especially because we were still in the days of inheritance and "is-a" instead of "has-a" objects.

I loathed all the overloading that suddenly << meant something completely different when doing I/O and weird multiple function definitions to provide the generics.

Much preferred the Objective C idea of messages, was much more what I understood OOP to be after Smalltalk.

But by then I'd made the leap to being an "architect" and got to pontificate from on high and languages became semi-irrelevant.

•

jll29 4 days ago

> Typescript for C

You mean "C with Classes", later to be replaced by "C++" (Stroustrup pick this as favorite from a list of candidate names he crowdsourced) as implemented by Cfront.

•

rswail 4 days ago

I preferred "P" because of the BCPL ancestor.

If BCPL begat "B" and "B" begat "C", then "C" should have begatten "P".

Not sure if begatten is a word :)

•

pjmlp 3 days ago

BCPL is short of Bootstrap CPL, given its initial purpose, the fact it took off on its own was not planned.

•

rswail 3 days ago

Most good ideas never are.

•

a96 3 days ago

It's "begotten" or "begot". :)

•

pjmlp 3 days ago

Of course, although Typescript for C++ is a more modern way to put it for younger generations.

•

pjc50 4 days ago

> You can erase your length information from the type. This was an intentional decision.

Well yes, but given the number of security issues the argument is that it was in retrospect the wrong decision.

•

BigTTYGothGF 4 days ago

> Your string size is in bytes and you need to track characters separately

No worse than C strings then.

•

AlienRobot 4 days ago

>The problem now doubles with the introduction of UTF-8. Your string size is in bytes and you need to track characters separately.

That isn't really a problem.

The problem with null-terminated strings is specifically what happens when you reach the end of the allocated array and there ISN'T a NULL character.

Every string function is designed to keep going until it finds the NULL character, so if a hacker gets rid of the NULL character, he can exploit pretty much any standard string manipulation function being used elsewhere in the program to manipulate whatever memory comes AFTER the string data structure.

No other data structure works like this. You can't mess this up in an array, because no function that manipulates arrays is just going to keep going until there is a null. That would be stupid because it would require users of the function to add a NULL to the end of their arrays before passing it to the function, so instead we just pass the size of the array to everything. Strings are the only data structure that assume there will be a NULL at end.

By the way, I read once that if you use UTF-32 every code point will be 4 bytes, constantly, but even then a single code point isn't necessarily a single character. Text is just complicated.

•

tredre3 4 days ago

> No other data structure works like this.

In C most data structures work like this, you keep going until you find NUL (character) or NULL (pointer). E.g. Strings, array of pointers, linked lists, etc. Of course you can add length to most of those, but it isn't the canonical/traditional way of doing things.

•

AlienRobot 4 days ago

That can't be true. If you have an array of pointers it can be terminated in NULL. But an array of integers can't have a NULL value, since NULL would probably be just 0 which is a normal integer.

The null in a linked list is the null in the .next field, right? That's the way you would implemented linked lists independent of language. It's not the .value that is null.

A string is an array of characters (well, for characters representable in one byte at least) that has a specific value to represent the end of string.

It would be like if Int::MAX was reduced by 1 to make space for an Int:NUL constant that represented the end of an integer array. Or if you were creating your own ENUM, let's say for NORTH, SOUTH, EAST, WEST, and you added a fifth enumeration called Direction.NUL for use in arrays.

•

jkrejcha 4 days ago

With an variable length array of structs, you can set all the fields all to 0 at the cost of an extra member at the end. In the cases where this is, the structures are such that (either intentionally or by consequence) something with all fields zero is outside of the function's domain

•

dare944 3 days ago

> No other data structure works like this. You can't mess this up in an array, because no function that manipulates arrays is just going to keep going until there is a null.

This is patently false. Sentinel markers are used widely in array types. Consider GNU's getopt_long() function, a mainstay in GNU tools:

The argument longopts must be an array of [struct option] structures, one for each long option. Terminate the array with an element containing all zeros.

•

lelanthran 4 days ago

> Every string function is designed to keep going until it finds the NULL character, so if a hacker gets rid of the NULL character,

What sort of situation are you envisioning where a hacker can remove the sentinel (in the case of nul-termination) but not modify the length bytes (in the case of fat pointers)?

•

AlienRobot 3 days ago

A situation in which a string is manipulated with buggy code that can remove the sentinel, e.g. the program uses strncpy, there is a bug in how it uses it, the hacker exploits the bug.

By contrast it's pretty unlikely for buggy code to mess the length. Add an element? +1. Remove an element? -1. Number of elements larger than capacity? Allocate a new array. Not much room for error.

•

imtringued 4 days ago

If I zero out the destination buffer of a strcpy, and the string is longer than the destination buffer I will run into a buffer overflow problem despite every byte being a zero byte. The absence or presence of the zero byte doesn't seem to be the deciding factor.

•

sourcegrift 3 days ago

Rust has "pascal style strings" (quotes because the concept is slightly different) so it's not a done deal

•

lelanthran 4 days ago

> the zero terminated string is I think is computing's biggest mistake.

No. They had trade-offs to make, and sentinel-based sequences are a needed thing, even outside of strings.

The mistake was that ISAs never looked at what HLL needed, then add the necessary instructions (I posted more about this below).

Even NULL is not a big mistake, when looked at in context of the time in which it was developed.

•

fragmede 4 days ago

compared to Von Newman versus Harvard architecture for LLMs? I think that's a far bigger mistake.

•

pjc50 4 days ago

Neumann, and .. what? In what way?

•

fragmede 4 days ago

Prompt injection only works because there isn't two streams of input to give to the LLM. Von Neumann being the architecture with a single shared memory for both data and instructions. If there were a clean way for the LLM model to distinguish between system messages vs user messages, we wouldn't have that problem.

•

amomchilov 3 days ago

I don’t know how you could keep the two isolated, without drastically dropping up the utility of LLMs.

Part of their wonder is how they can behave differently depending on the data they’re working with. We like that feature when the data is the “good stuff” (docs, compiler messages, etc.), but how you tell that apart from “bad stuff” (prompt injection on official-seeming pages).

We basically expose LLMs to the same social-engineering vulnerabilities that humans have.

•

dietr1ch 4 days ago

I think it was NULL itself. It was a long way until we realised we don't want invalid values and could use the type system to help us use special values safely.

•

jkrejcha 4 days ago

The problem here is that null kinda is consequential of intentional design of the type system itself. In this way, I do think that null was discovered, rather than invented. Remember, C is a kinda "portable assembler" so the constructs in it are based relatively closely to how low level data structures are mapped out in memory.

This is, and continues to be, an incredibly useful feature that makes C and C structs immensely useful concepts. Part of that does need an invalid value[1]. NULL is convenient for this and although there are some very weird JavaScript-trinity-meme-style consequences for this[2], it's such a useful concept that basically all languages that have the ability to construct pointers have a null pointer[3].

The alternative world looks like everyone inventing their own invalid values. Invalid, non-null, pointers are typically MUCH worse than null pointers for debuggability and security. If you unintentionally read/write/execute memory at 0x0 (by far the most common value for NULL), most operating systems will trap this, whereas may not necessarily if 0x12345678 is your invalid value.

[1]: Stuff like IA64 had NaT bits which were effectively an extra bit for what I assume to be this sorta thing. The problem with this is that it costs an extra bit. I don't really know much about IA64, but presumably [NaT 1] + [don't care] would be your null pointers here. I think?

[2]: Really what the standard, in my opinion, should have done is probably not make use of the null pointer UB for many different functions. A lot of compilers took the UB surrounding that to make incredibly dubious "optimizations" that broke stuff with zero actual performance benefit whatsoever

[3]: Yes, even Rust. Although some (again in my opinion) unfortunate design decisions made it so that C-Rust FFI isn't zero cost because of how it treats spans/slices

•

imtringued 4 days ago

>[3]: Yes, even Rust. Although some (again in my opinion) unfortunate design decisions made it so that C-Rust FFI isn't zero cost because of how it treats spans/slices

If Rust slices already make you sad, then the thing I'm cooking up will make you cry for days.

•

bellowsgulch 4 days ago

Compared to scripting languages with actual tagged types, C doesn't really have a type system, and that's readily apparent to anyone who has written C in the last 43 years and debugged a program written in it.

C pretends types exist with you, but once bytes hit the road, it's all real-life and segmentation faults.

•

AlotOfReading 4 days ago

C actually does have a type system and it's one of the bigger issues with the language. If it didn't, unaligned pointers and signed overflow would be totally fine.

•

Gibbon1 4 days ago

Problems with unaligned pointers are basically a hardware defect. Signed overflow is an issue because academics are unhappy computers only can do finite math.

Issue with types and C is while the compiler knows about them the standards committees don't want you to be able to. If C had first class types more people would abandon C++ and that can't be allowed to happen.

•

imtringued 3 days ago

The concept of alignment isn't a hardware defect, maybe limitation, but the reason why alignment is a thing has to do with the fact that in chip interconnects transfer blocks. You cannot perform misaligned memory accesses against RAM.

A similar limitation exists when peforming accesses against the cache, but at a much finer granularity.

For bytes, the alignment restriction obviously exists in the 8 bit level. You have one output byte and 64 multiplexer inputs.

If you scale this up to 8 bytes, you will need a lot of 64 Input multiplexers.

But even if you can take the silicon area hit, there is the problem of crossing cache lines and pages.

In the end, you cannot divide memory into blocks and allow primitives to cross those blocks without requesting both blocks at the same time. That's an inefficient waste of resources so why support the wasteful usecase in the first place?

•

Gibbon1 3 days ago

With modern CPU's unaligned accesses only matter when it straddles a cache line address not a register address.

•

AlotOfReading 3 days ago

Unaligned pointers are undefined behavior even when the hardware fully supports unaligned access, because you're violating the type's rules.

To be honest, I've never seen much indication that the C and C++ committees are particularly fond of each other. They sometimes coordinate, but they're mostly content letting each other evolve in different directions. C is the way it is only after a long process of evolution away from the bits and bytes of BCPL into the strictly typed language we got from ASNI.

•

Gibbon1 3 days ago

Undefined behavior according to WG14 but perfectly fine on most architectures. Even on architectures that don't support it (what the fuck ARM cortex) the compilers do support it.

•

AlotOfReading 3 days ago

Yes, that's exactly the point I've been making. It's undefined today because C has a type system and that type system has this arbitrary rule, not because there are implementation constraints necessitating it (where it could just be implementation defined instead).

•

DarkUranium 4 days ago

By that logic, no natively-compiled language has a type system.

Though I should note that in a way, even some ISAs have one, what with e.g. separate float vs integer registers.

•

atherton94027 4 days ago

Genuinely curious, how would you handle cases where a value is unset without NULL? This is a legitimate case that happens a lot in eg data modeling

•

pdimitar 4 days ago

Sum types, of course.

•

lelanthran 4 days ago

How do you expect to use sum types in assembly? Remember where C came from and why it was designed the way it was.

•

imtringued 4 days ago

A naive sum type is just a tag plus a payload. There is no problem here. If you have enums you could have had sum types.

The historical argument and appeal to assembly is illogical here. The only real argument is that niche value optimization is too complex or too clever for the time so even if sum types were in C, nullable pointers would still exist either way.

•

pdimitar 4 days ago

I remember why C stayed what it is at least: elitism and gatekeeping. And YAGNI, repeated millions of times, of which only the first few were correct.

You're telling me OCaml / Rust / Haskell compile to fairy pixie dust? Obviously their compilers figured it out and it works.

•

lelanthran 4 days ago

> I remember why C stayed what it is at least: elitism and gatekeeping.

If that was the goal, it failed horribly - the gatekeeping didn't work because the popularity exploded.

> You're telling me OCaml / Rust / Haskell compile to fairy pixie dust? Obviously their compilers figured it out and it works.

I said nothing of the sort.

•

pdimitar 4 days ago

You asked how sum types work in assembly. I'm telling you that at least 3 compilers figured that part out.

•

lelanthran 4 days ago

> You asked how sum types work in assembly.

No, I didn't - I asked how sum types were supposed to work in an era of 64KB memory systems.

•

imtringued 4 days ago

They don't need extra memory in Rust for the case of nullable pointers.

The boring cases require an enum tag in C too.

By bringing up the one thing that doesn't matter, your argument becomes purely ideological.

•

lelanthran 4 days ago

You're missing the point - give me a Rust compiler that can run and compile in 64KB memory, then you'll understand that the language C was constrained not just by what the output is running on, but by what the machines of the time could actually handle during compilation.

•

Parodper 3 days ago

Borland's PASCAL did it on the IBM PC.

And which modern C compiler fits into 64KB? Even TCC needs 100KB. But that's beside the point. No machine of the last 36 (I'll push my chances, 40) years needs to fit a compiler in 64KB.

•

lelanthran 3 days ago

> Borland's PASCAL did it on the IBM PC.

That's famously a single-pass compiler. Rust is famously unable to compile in a single pass.

It is not possible to make a borrow-checking language that compiles in a single pass.

> No machine of the last 36 (I'll push my chances, 40) years needs to fit a compiler in 64KB.

Exactly - that's why C is what it is: it wasn't a mistake, they were working under the constraints of the time. My original comment (that you appeared to disagree with) said specifically "Remember where C came from and why it was designed the way it was."

Let me ELI5 it for you: It was specifically designed to emit assembly in a single pass because of the constraints of the time.

WTF does "Hur Dur Rust Goodest!" comments mean in this context?

•

Parodper 3 days ago

> That's famously a single-pass compiler. Rust is famously unable to compile in a single pass.

I probably should have replied under the other comment. I was also referring to your

> No, I didn't - I asked how sum types were supposed to work in an era of 64KB memory systems.

But context got lost between replies.

> that's why C is what it is

C famously had a big redesign in 1990. The language of today isn't the same K&R printed.

•

atherton94027 3 days ago

Pascal had pointers? They could be `nil` too https://www.freepascal.org/docs-html/ref/refse15.html

•

Parodper 3 days ago

The thread talked about sum types, which apparently appeared on ALGOL; although I don't know how much memory did an ALGOL compiler need.

•

atherton94027 4 days ago

How are you going to build sum types in a way where you can interact with assembly or machine code? The CPU doesn't know about that stuff

•

pdimitar 4 days ago

OCaml / Rust / Haskell.

Apparently they found a way to have the CPU know... about "this stuff".

•

imtringued 3 days ago

Sum types map down to reading a tag and doing a comparison against fixed values.

I don't know what to tell you, but you're clearly not cut out to be a software developer in either machine code, assembly or C or any other language if you don't understand something this basic.

•

atherton94027 3 days ago

Please check your tone down, I'm arguing politely with you but apparently you're so wrapped up in this that you're resorting to ad hominems.

Sum types aren't the be all end all to all issues, for example you can not representer pointer values efficiently with sum types. Even rust does not wrap up pointers with sum types. Now try to go back 37 years to C89 and ask yourself if they were going to require compilers to have stringent checks like the rust compiler does.

•

pdimitar 3 days ago

Nobody claimed that "sum types are be all end all". I originally responded to "how would you handle cases where a value is unset without NULL" with "sum types" which are trivially presentable with bit masks if memory usage is of big concern (and nowadays in 99.9999% of the cases it genuinely is not).

And, tagged unions are a thing and were a thing for a long time.

Of course it's too late to change all this today; it would have been too late even 20 years ago. But outside of f.ex. Linux kernel and some other super hardcore C libraries, a lot can be done for the world to migrate to safer constructs and away from sentinel values. And that's what languages like Rust do.

Super memory constrained environments have not been the mainstream programming work for decades and now remain limited to embedded / IoT. Not sure what the reservation against sum types is these days.

•

atherton94027 3 days ago

Yeah and I was trying to explain that sum types don't work for pointers, without a significant performance hit.

No one here is saying C is a great design, but in the context of 60 years ago, it worked out pretty well, and all the language which had additional runtime complexity (Pascal of course, but also Ada and FORTH) struggled because they didn't dial the right level of complexity

•

bigstrat2003 3 days ago

That's unnecessarily rude, and untrue in any case. Everyone has to learn stuff sometime, and most people won't naturally run into the implementation details of how higher level languages get translated into machine code.

•

pdimitar 3 days ago

I agree his tone was not productive but the comment he responded to seemed like a disingenuous argument as well. "The CPU doesn't know about that stuff" is not true -- or it's arguing in bad faith. I mean, hello, tagged unions, all of us with some experience can write a C program that works with those. It's 100% false to say what he said.

•

clnhlzmn 4 days ago

The way we do it in modern languages with things like std::optional and even that is not the best example.

•

MBCook 4 days ago

And higher level languages that works. But what do you do when you get down to low level C or assembly?

You basically end up with null/0 don’t you?

•

paavohtl 4 days ago

Rust is a significantly higher level language than C, but it can be used it almost all environments where C is used; provided there's a supported compiler target for it. In (safe) Rust, null is basically a guaranteed compiler optimization. Optional / nullable values are represented via Option<T>, which is a sum type of Some(T) and None. When a reference or other pointer-like value (e.g. Box<T>, an owned heap allocation) is wrapped in Option, the compiler can use the invalid bit patterns of T (such as null) to represent the None variant. This is called niche optimization.

So yes, it's nulls underneath, but the developer never has to think about them.

•

dietr1ch 4 days ago

Eventually you end up with registers that probably allow for 2^N values. But the point is not thinking about the machine executing the instructions, but the construction on top of it that has a safer design.

Seeking performance we've been very prone to avoid abstractions and over and over again have shown why we need the safe abstractions.

•

jibal 4 days ago

They already said:

> use the type system to help us use special values safely

... but this is not the place to explain what a type system is or what sum types/maybe/optional/etc. are.

•

jkercher 4 days ago

Meh, I think NULL is fine in C. It's an extra, valid state to represent pointers at no cost. Unlike the more hand holdy languages, it's quite rare for a pointer in C to have the ability to be NULL since, more often than not, it's pointing at something known. It's actually quite rare to see NULL checks unless it's API code or something like that. I can see this being more of a problem in a managed language where anything can be NULL at any time.

•

bvrmn 4 days ago

NULL as a concept is fine. Inability to declare something as non-null is not.

There is a huge gap between developer expectation "it's pointing at something known" and hard reality confirmed by zillions of CVE. That's the reason optionality is prevalent in modern languages and type checkers (python, typescript), nowdays even Java has sane non-nullable types.

•

kelnos 4 days ago

> to represent pointers at no cost

I wouldn't call "cause of bugs and security issues" "no cost".

> it's quite rare for a pointer in C to have the ability to be NULL

As a C programmer for more than 25 years, that is the exact opposite of my experience.

•

none_to_remain 4 days ago

Struct foo has various members, including a bar*. But a foo may or may not be associated with a bar. If there's no associated bar, the bar* pointer is NULL. Seen and done this all the time

•

XorNot 4 days ago

The problem with let's get rid of NULL is that it's a real, required state. The vast majority of computing is actually not binary: any real input generally has at least 3 possible states: not set, true and false.

In practice really 4 because "indeterminate" is a reasonable error condition you'd like to know about.

And it keeps increasing anyway: e.g. not set has subcategories: not set due to lack of user input, not set because we're loading state from the backend etc.

NULL is the first expression of that basic problem: it's definitely not enough to eliminate NULL because the first thing which happens is your non pointer default value takes it's place.

•

lambdaone 4 days ago

What you are describing is option types, which are an entirely valid and very useful construct that helps make programs more rather than less reliable. But you need proper language type system support and compile-time enforcement to make it work, and C does neither of those.

•

bnolsen 4 days ago

C++ and rust make these optionals ugly. Zig does it right. Zig also forbids null pointers and requires use of optionals.

•

cm2187 4 days ago

A lot of pain and suffering to avoid having a string datatype.

•

teo_zero 4 days ago

> A lot of pain and suffering to avoid having a string datatype

No, a lot of pain and suffering to work around the lack of a string datatype in C.

•

edoceo 4 days ago

What's a way they could get a strong data type here? Wouldn't that also require a large refactor of the code around strncpy to use the type and its functions?

•

cm2187 4 days ago

Today yes, but 40 years ago someone made the decision that a string was a char array and that every string manipulation going forward would require manipulating arrays. Talking about costly decisions.

It’s actually interesting to compare the pain and suffering of switching to a string datatype in the 80s (refactoring the limited code base then) vs the next 40 years of unnecessary boiler plate syntax and bugs for not having this type in key APIs.

•

tialaramex 4 days ago

Linux doesn't exist in the 1980s, Linus started this work in the 1990s.

But yes, the string slice type should have existed in C89 and it's very obvious from here that not having something of this sort - maybe what Rust would call &[u8] the reference to a slice of bytes - was a big problem for C.

The correct way to represent this is what's called a "fat pointer". A pair of values, one is a conventional "thin" pointer to the start of the slice, and the other is a count. Your register pressure increases in the compiler backend but problems are significantly reduced because you have fewer bounds misses.

•

mirsadm 4 days ago

I'd be curious to see how much CPU time is wasted on looking for a null every time strlen is called. The extra length integer is probably insignificant compared to that.

•

tialaramex 4 days ago

It is very expensive if you repeatedly measure and forget the length, this is presumably some of the price in Google's problem where some engineers wanted to use 0-terminated char* as the type of a string but others wanted C++ std::string and so the software ends up measuring how long the string is, allocating and copying, then immediately forgetting that length, only to once again measure how long it is, allocate and copy again.

That's a language design defect, C++ got its string slice reference (named std::string_view) only in 2017, years after Rust 1.0 shipped this as a core language feature, even though C++ is decades older.

On the other hand I can well believe on a 1970s computer where you'd be lucky to have 64kB of RAM the trade looks very different, I just think that by C89 it should have been fixed.

•

1313ed01 3 days ago

The problem with C++ string slices is that after many years of C++ becoming increasingly memory-safe with std::string and smart pointers, now we reverted to something barely more safe than a C string.

I guess Rust can keep slices safe using some borrow checker magic or something, but C++ can't.

•

tialaramex 3 days ago

It's true that the borrowck is what checks you didn't screw this up in Rust, but "it's your job to never make mistakes" is just how C++ always works

If you keep const pointer to a std::string in C++ in 2016 that's exactly the same danger (of dangling pointers) as for a std::string_view in 2026, there's no change to that part.

•

nuc1e0n 4 days ago

That's what pascal did back in the day, but 255 byte strings were all that was needed back then so only a byte was needed to store the length. Does that still sound maintainable? Anyhow, some developers put data into strings when they shouldn't, and require doing that in the APIs they publish. Strings, whether NUL terminated or with stored length, aren't always the best choice architecturally so making them easy to use isn't necessarily a good idea.

•

tialaramex 4 days ago

Are you muddling the Pascal strings with the string slice?

•

BoingBoomTschak 4 days ago

> Today yes, but 40 years ago someone made the decision that a string was a char array and that every string manipulation going forward would require manipulating arrays

That's not a bad thing, Common Lisp does the same and it Just Werks. The real problem is the more general "array to pointer decay", not arrays, really.

•

layer8 4 days ago

The decision was made almost 55 years ago for C and Unix: https://en.wikipedia.org/wiki/Null-terminated_string#History

•

wolfi1 4 days ago

around the same time Wirth decided to have a length prefix in his Pascal strings (that's why string adressing began at 1, because 0 would be the length of the string)

•

Tempest1981 4 days ago

I wonder if an early built-in string class would have been 8-bit or 16-bit UCS-2? Would have been hard to stomach 16-bit storage and performance.

•

jibal 4 days ago

The purpose of strncpy, which was originally part of the UNIX kernel code, was to copy file names to and from directory entries that consisted of a 2 byte inode number and a 14 byte zero-padded but not zero-terminated name field.

I started warning my colleagues against using it the moment I saw it for the first time about 50 years ago.

•

dare944 4 days ago

strncpy appears somewhere around the Unix v7 time frame, however only as function in the standard C library. It is not used in the v7 kernel itself.

•

JdeBP 4 days ago

In Unix Seventh Edition, ls and others read directory entries with fread() and parsed the struct direct themselves in application-mode code. The C library and application mode matter, here.

On the gripping hand, there is no strncpy in the Spinellis 7th Edition source code; 4.2BSD was using strncpy() inside readdir() in 1982, though.

•

jibal 4 days ago

Here it is in 2.9BSD: https://www.tuhs.org/cgi-bin/utree.pl?file=2.9BSD/usr/src/uc...

•

jibal 4 days ago

https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/lo...

The code, but not the function, occurred in multiple places in the V6 kernel and userland.

•

dare944 3 days ago

> The code, but not the function, occurred in multiple places in the V6 kernel and userland.

Yep. The code is essential given the design of the direct structure, which harkens back to the fixed-width data fields of punched cards.

•

jibal 3 days ago

It doesn't have anything to do with punch cards, it's to pack as many elements as possible into the very small amounts of memory on PDP-11s. A 16 byte directory structure (which divides evenly into a disk sector) with a 2 byte inode number and an up to 14 byte name is a memory-optimized structure, and memory optimization drove everything on UNIX. (I've been programming since 1965, used punch cards for a decade, was a UNIX V6/V7/PWB kernel and userland developer for a different decade).

•

dare944 3 days ago

I said "harkens". Of course that structure never appeared on a punched card, and was designed with the unix block size in mind.

•

jibal 2 days ago

I know the guy said "harkens", and I pointed out that it's wrong (but maybe he doesn't understand what it means, and thinks it's the same as "reminds me of"). Fixed size fields both precede and follow punch cards ... they are still used today in every struct.

•

jibal 4 days ago

The code for strncpy was in the UNIX kernel since at least V6. It was eventually added to the C library under the name strncpy. Sometimes those entries were processed in userland, e.g., by fsck. The utility of strncpy is noted in the C89 rationale (FWIW I was once a member of X3J11, the C89 standards committee):

"strncpy was initially introduced into the C library to deal with fixed-length name fields in structures such as directory entries. Such fields are not used in the same way as strings: the trailing null is unnecessary for a maximum-length field, and setting trailing bytes for shorter names to null assures efficient field-wise comparisons. strncpy is not by origin a "bounded strcpy," and the Committee has preferred to recognize existing practice rather than alter the function to better suit it to such use."

And I just found this comment from John Mashey (I never met John but he and I both worked under Ted Dolotta, John at Bell Labs and me at ISC in Santa Monica):

https://softwareengineering.stackexchange.com/questions/4380...

"I can answer definitively, since I wrote the originals ~1977, having moved from BTL Piscataway to Murray Hill. They were first named str*n, but were later renamed strn*, as there was some system in BTL that needed first 6 letters of external names to be unique.

I was working on kernel & user code that supported rudimentary per-process accounting, which started with someone else, but needed extensions due to big increase in UNIX systems in computer centers, who wanted more performance analysis. I.e. this was supported by commands like accton(1), acctcms(1),acctcom(1), acctmerge(1) (all in UNIX/TS 1.0, Nov 1978, which was ~Research V7 with first steps of PWB/UNIX influence. Think of that as 1.0, then PWB/UNIX 2.0, then UNIX System III...

The records described in acct(5) held the last 8 characters of the command pathname,truncated if necessary and thus possibly not null-terminated. I found multiple instances of inline code to manipulate these, which seemed a bad idea, so I wrote the str*n functions and replaced the inline code, and also used them in the various commands.

I also thought it was a good idea for better code safety.:-) Sigh."

•

PlunderBunny 4 days ago

I worked on a Win32 app that used space-padded strings, i.e. the destination string was padded with spaces, but there was still a null on the last byte. You had to use special versions of the string functions for length, copy etc.

I’m not sure why this was - the source base was so old it might have had its origins in Pascal struct behaviour.

•

jkfkfkj 4 days ago

It can perhaps be due to the string originating from a sql database ”char” field, I.e. not ”varchar”. Char fields in databases are space padded.

•

egorfine 4 days ago

I think this behavior has its roots in COBOL, not pascal.

•

kps 4 days ago

Which has its roots in punch cards, where pre-computer hardware operated on fixed-sized fields and an unpunched column is equivalent to a space.

•

bebe83939 4 days ago

Perhaps prevent realocation when string size changes? Or aligning cpu cache lines?

•

senfiaj 4 days ago

I wonder, why not use a string buffer paired with its length? For example, maybe use struct that has char pointer, and 2 ints (occupied length + total buffer length). Almost like c++'s std::string. This null terminator thing really sucks, it's potentially insecure and often unperformant.

•

WalterBright 4 days ago

Wonder no longer!

https://dlang.org/spec/arrays.html#dynamic-arrays

and

https://dlang.org/spec/arrays.html#strings

and for C:

https://digitalmars.com/articles/C-biggest-mistake.html

•

maxlybbert 4 days ago

It's definitely possible. And common, at least in some projects. The only real drawback is that sloppiness will lead to multiple slightly different nonstandard string types in the same project.

•

GalaxyNova 4 days ago

Yes I have seen it happen a few times with `strlen` being called in a loop silently causing O(N) to turn to O(N^2)

•

jkrejcha 4 days ago

Reminds me of an article[1] that described how he cut GTA Online loading times by 70% because strlen was getting called for effectively every character in a string

[1]: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

•

sweetjuly 4 days ago

I remember reading this blog post when it was first published, but the subsequent updates are better than I would've ever expected this to turn out. Worth checking it out again if you've seen it before :)

•

senfiaj 4 days ago

Exactly, you can't write clean concise code when working with c strings. Almost every c string manipulation requires cognitive load: "Is the buffer size enough (including null terminator), should I reallocate it?", "I need to have the offset from the last concat, to make next concats performant", "Umm, shold I put null terminator at i or i + 1?"... It really sucks, it's akin to death by thousands of cuts.

•

sgerenser 4 days ago

Joel Spolsky coined the term “Shlemiel the Painter’s Algorithm” for this type of thing back in 2001: https://www.joelonsoftware.com/2001/12/11/back-to-basics/

•

bnolsen 4 days ago

That's called a fat pointer. Null terminated c strings is the majority of memory errors out there.

•

none_to_remain 4 days ago

The size overhead of that is 2*sizeof(int) while the overhead of null termination is sizeof(char). If I remember the standard right, the former is worse by at least sizeof(char), and usually more in practice. This used to matter, sometimes still does.

•

kgeist 4 days ago

I would assume the difference is mostly negligible in practice due to the allocator rounding up the allocated memory size at least by the word size anyway (for alignment and simpler bookkeeping). You can also use variable-length encoding in the header to use 1 byte for most cases, similar to how UTF-8 does it: if the most significant bit is not set, we assume a 7-bit encoding, which can represent string lengths up to 127 using 1 byte, which is probably 99% of strings.

•

senfiaj 4 days ago

Well, not saying to always use it, but if the string size is big enough, the overhead of 2 ints becomes relatively vanishing. For generic dynamically sized strings it probably has more advantages than disadvantages. But in any case, sure, if every single byte matters or some structure requires specific memory layout, then fine. I just don't think these things are the majority of use cases. Keep in mind that the cached lengths can increase performance, since you don't have to recalculate string lengths.

•

lelanthran 4 days ago

> Well, not saying to always use it, but if the string size is big enough, the overhead of 2 ints becomes relatively vanishing.

In that case, the fix is not to change C strings (breaking a lot of existing code), but to introduce a stringbuilder type.

•

senfiaj 4 days ago

You can still use null terminator for compatibility (std::string does use this), but just not rely on that in your own code.

•

ekaryotic 4 days ago

I am a terrible hobby c programmer that doesn't understand pointers but surely a symmetric approach doesn't have the overhead or the bug. that is to say that if the language was designed to work in single bit pairs of a string character in conjunction of a string length character assuming a fail safe design of one dummy string character then if a bug happens in the code then there's no overflow because the length can never be shorter than the character.

•

chiph 4 days ago

Pascal did/does this, but eventually someone wants a string longer than the size portion can handle. Or wants the number of characters not the number of bytes.

•

jerf 4 days ago

I wasn't a programmer in these days, so I don't know if there's some other major concern that would kill this, but I sometimes wonder about whether we could have / should have used variable-length integers. That is, something like, 0-127 byte strings get their length prefixed, 128 - 16383 get two bytes of prefix, and the probably-rare 16384 - 2097151 strings would end up with three, though proportionally by that point it's hardly anything. Or you could use the UTF-8 mechanism for packing the bytes, though that costs more and probably doesn't get anything we'd care about in the 1980s or 1990s.

It's a bit of extra code, yes. Not necessarily all that much, but some. On average it is only slightly more expensive than null termination, and considered as a proportion of the size of the strings themselves it's hardly anything. It's probably better than the strings getting hard-limited to 0-255, though, which was quite frequently a user-visible quirk.

•

Parodper 3 days ago

You could start the encoding with two bytes, so that if the most significant bit of the first byte is 0, the length is that byte plus another. That gives you 32KiB strings with just a byte more. Short strings might suffer, but I think the overhead is reasonable.

The next level (110x xxxx) would give you 8MiB strings, which are going to be fine for most things.

•

senfiaj 3 days ago

32-bit int isn't too much overhead. Just 3 additional bytes. I bet it's almost always better than c style strings. In the vast majority of situations the price isn't that bad, considering you make strings much more secure and potentially faster in string manipulations.

•

jerf 2 days ago

32-bit is so little overhead that we don't blink at adding 64 to our strings nowadays, because of the benefits we get from alignment.

But remember the first Macintosh shipped with 128KB of RAM, 131,072 bytes. Three more bytes per string hurts a lot more there...

... although, that said, even in that era given the number of errors that null-terminated strings caused, even completely ignoring security, I do still wonder if at least defaulting to 2 bytes of length and doing something special for strings over 64K still wouldn't have been the right tradeoff, even in the case of short strings. Today we mostly focus on security, but null-terminated strings also caused a lot of just plain-old bugs. But so did 1-byte length strings... it's way too easy to run out of 256 characters even on those old systems.

•

Johanx64 4 days ago

Dude, every sane language out there does this. Just generally with 4byte prefix. Null-terminated stuff has always been backwards compat stuff.

Pascal strings - historically and why people even remember this being an issue - were up to 255 chars in size, if not you had to use different string type.

You might still want raw pointers for all sorts of low level stuff, but you almost never want to have null-terminated strings for anything but back-compat, one of the worst things ever, even on memory constrained systems.

•

pjmlp 4 days ago

And then anyone that isn't stuck in 1976 will use open arrays.

•

MBCook 4 days ago

A lot of them are strings coming from or going to user space right? So wouldn’t you have to do constant conversions?

•

D-Coder 4 days ago

Note that "360 Patches" is 360 uses of strncpy that have been removed, not necessarily bugs.

•

dpark 4 days ago

I would imagine 360 patches removed way more than 360 uses of strncpy. But yeah, it’s not a given that each of these patches addressed a bug. (Also not a given that there were only 360 bugs fixed.)

•

rswail 4 days ago

In all the comments in this thread it's interesting how people confuse:

* NUL: An ASCII non-printing character with the byte value of 0

* NULL: A pointer that does not point to usable memory with the value that compiles in C to be equal to ((void *) 0).

•

layer8 4 days ago

NUL was always just an abbreviation for null: https://www.rfc-editor.org/rfc/rfc20.html#section-4

I don’t think anyone in this thread is confusing the null character with the null pointer.

•

rswail 4 days ago

I've seen a lot of confusion, where people are talking about checking for a NULL at the end of a list of pointers which is very different to a NUL at the end of a string.

Yes it was an abbreviation in ASCII, as are all the non-printable first 32 codes.

•

stcg 4 days ago

I wonder what is the difficulty in rewriting strncpy uses that makes it take six years? Was it widespread? Or was it more of a long going effort, where it was only changed if there were some changes in the same file? Or is there some other thing that makes it difficult?

•

kstenerud 4 days ago

strncpy is 99.999% of the time NOT the correct function to call, so this is a huge win.

It's just a shame that such a confusing name was chosen for such a niche use case (fixed width records that require null padding).

•

DerSaidin 4 days ago

strtomem_pad seems redundant with memcpy_and_pad, and also it requires the preprocessor: https://github.com/torvalds/linux/blob/1a3746ccbb0a97bed3c06...

I was curious: Why have it, instead of just using memcpy_and_pad?

AI's answer (paraphrased) was * Avoid possible bugs from manually write sizeof(dest) * Enforces the __nonstring Attribute * signals: "I am converting an actual C-string into a fixed-width legacy memory field." vs copy binary data & pad it.

Interesting to learn about the __nonstring attribute:

https://github.com/torvalds/linux/blob/1a3746ccbb0a97bed3c06... https://github.com/search?q=repo%3Atorvalds%2Flinux+__nonstr...

•

GTP 3 days ago

I always thought that srncpy was the safe alternative to strcpy. Now that I think of it, I'm unsure if the NUL terminator is counted into strncpy's size or not, which would be a likely source of errors. But, could someone explain better what the problems were? And also, would have to pick the right function in the list of given alternatives much better?

•

GabrielTFS 3 days ago

The issue with strncpy is that it doesn't actually necessarily terminate - in fact in any case where the source is larger than the destination it will just leave it unterminated (like, it will copy the last character it can from the source instead of terminating the destination string with a NUL)

•

rurban 3 days ago

No, the safe alternatives end with _s. They do check matching buffer sizes, and enforce zero-termination. Unfortunately WG14 hates them also, because Microsoft. Microsoft did indeed break some of the, but you can use better alternatives, like my safeclib

•

GTP 3 days ago

> No, the safe alternatives end with _s.

Could you please elaborate on this? Both `man strncpy_s` and `man strcpy_s` didn't return any manual page on my Linux system.

•

rurban 3 days ago

Search more. It's in safeclib and in the C standard

•

devsda 4 days ago

Did anybody else misunderstand the title as removing strncpy func for linux users ?

For a moment, I misunderstood it as (g)libc removing strncpy and was worried about the trouble its going to cause.

•

henrypoydar 3 days ago

No code is faster than no code.

•

naturalmovement 4 days ago

A reminder that we've had strlcpy[1] for ~ 30 years but it was never accepted into the Linux world because of typical petty open source bullshit. This is why we can't have nice things.

[1] https://man.openbsd.org/strlcpy

•

ericbarrett 4 days ago

The Linux kernel had strlcpy over 20 years ago. It was removed in favor of strscpy because the latter was judged a better interface. Here's a 2022 article: https://lwn.net/Articles/905777/

•

avadodin 4 days ago

Returning an error is better but you're using ssize_t which is a tradeoff.

The race conditions appear to be a result of the Linux kernel implementation but UNIX style syscalls introduce these races by default. It is not an inherent flaw of the API or even the implementation Linux was using.

The only useable C string API has always been memcpy anyways.

•

BoingBoomTschak 4 days ago

Actually, glibc 2.38 has it.

•

naturalmovement 4 days ago

Wow it only took them 26 years to import a 30 line C function, a third of which is comments?

I should have sent them a nice fruit basket to commemorate the occasion.

•

pjmlp 4 days ago

Now lets put that work into money, to assert what was the cost impact of replacing strncpy().

•

qarl 4 days ago

Am I going to be the first person to ask this after five hours? Really?

Wouldn't this work be extremely easy to implement with an LLM coder?

•

qustio 4 days ago

I don't think the bottleneck was that it took six years to Ctrl-F strncpy and type in new code for each file.

•

qarl2 3 days ago

I looked at the git history. The first three years were wasted waiting for a human to pick it up. He then very slowly submitted patches over 2 years.

Claude Code doesn't need to be interested to work.

•

qarl 4 days ago

It's a shame you're misrepresenting what is actually going on.

In another comment here I explained that I have run a test: asking Claude Code to add a substantial feature to 270 different C programs.

Despite your beliefs - it went extremely well.

•

qustio 4 days ago

Huh, are you confusing me with someone else? I don't doubt Claude Code did that, I do the same for refactors all the time.

But xscreensaver theme tweaks for personal use have a much lower standard for quality control, regression testing, side effects, etc than a kernel used by billions of devices with thousands of interconnected drivers and subsystems.

Not to mention the coordination problem to get every maintainer on board and patches approved for each specific area when working on a project of that scale, even for a relatively narrow change.

Claude Code doesn't really help with that so don't see why the expectation would be a significant speed up (and doing it all in a single patch would definitely be rejected).

•

qarl 4 days ago

Yes, I understand the difference in rigor.

I refuse to believe the six year delay here was getting people to test a patch.

Which, actually, Claude Code will also do quite well.

•

qustio 4 days ago

Not sure why you'd refuse to believe that when a single, simple patch in Linux can take months to make it into a kernel release. Here we're looking at 300 patches scattered throughout a kernel with millions of LoC. That's going to translate to a lot of mailing list back and forth even if every change was accepted on the first try without a fuss.

•

qarl 4 days ago

The lag there is not due to the review time. How many maintainers were involved? 300? Because I'm still finding it hard to understand how the work of 300 people handling 300 commits cannot be parallelized into months (per your own stat.)

•

qustio 4 days ago

To be clear my original statement was that the bottleneck was most likely not mechanical code changes (where CC would have the most direct speedup) but everything else involved in the process (testing, discussion/approval, inclination towards caution, deliberately narrowly scoped changes, etc).

Not that the Linux kernel approval procedures couldn't be streamlined, work couldn't be parallelized, or anything else like that, which would be a different discussion entirely.

You stated that Claude Code could have significantly sped up the process, so the burden of evidence here should be on how specifically these patches would have benefited/time saved from using LLMs. Hand wavingly saying "LLMs = faster" is too vague/broad of a claim without providing any evidence (and also unfalsifiable).

•

qarl 4 days ago

Right.

And what I'm saying is I refuse to believe the Linux kernel approval procedures are that inefficient. Therefore, your belief "bottleneck was most likely not mechanical code changes" is most likely incorrect.

It would be interesting to get the actual answer to this question.

EDIT: Substantially changing your argument after posting isn't nice. But to answer your charge - no - I never made that claim.

•

qustio 4 days ago

Sorry, I didn't feel like this thread needed to be dragged out any longer since it's going in circles at this point and expanded my comment, but I didn't realize you had already replied.

•

qarl 4 days ago

Well - not really a circle. I keep saying the same thing over and over and you keep throwing arguments at it, unsuccessfully.

•

NetMageSCW 4 days ago

I don’t think it’s the arguments that are unsuccessful.

•

qarl 4 days ago

If you have an actual criticism of my argument you'll need to be more clear to be understood.

•

lelanthran 4 days ago

> In another comment here I explained that I have run a test: asking Claude Code to add a substantial feature to 270 different C programs.

That's a different scenario, though.

Would Claude have performed adequately if it had to add a specific feature to 270 programs buried in a set of 270m program, each of which may or may not have a dependency on one or more of the others, with virtually unbounded results to test?

In terms of tokens alone, that would have been cost-prohibitive. But lets assume that you had the money to do this: it still might not even be possible.

You're confusing "I have these 270 independent programs and want to make this change to all of them" with "I have these 270m lines of code, of which only 270 needs to be changed".

•

qarl2 3 days ago

HackerNews is now censoring my replies. I did the math - all of these patches would have cost around $100.

Let's see if they'll let this account through.

•

lelanthran 3 days ago

It's like you are not even reading what is being said to you. You can't find the downstream effects using grep!

You can find the "strncpy"s with grep, but you cannot find all the downstream effect of those changes, especially if something downstream is relying on the broken behaviour!

•

qarl2 3 days ago

Right. I am not claiming Claude Code creates perfect software. I am refuting your claim that using it would be cost prohibitive.

I took the 10 most difficult patches from the git history - the ones that took the most back-and-forth to fix. I asked Claude to write them. Would you like to see the work?

If you believe a human performs better at finding downstream effects - you need to prove that. I see no reason why it should be true.

•

lelanthran 3 days ago

> If you believe a human performs better at finding downstream effects

Once gain, you are not reading what is being said - no one made that claim!

No claim was made in fact: it was a refutation. Specifically, the refutation is "this is why it took so many years".

•

qarl2 3 days ago

> no one made that claim!

You did not literally make that claim but your cost argument hinges on it.

Without it, then Claude does about the same as a human and only costs $100.

Apparently I'm reading your comments more thoroughly than you are.

•

Animats 4 days ago

This is a job for Claude!

What happens if you turn a job like that over to Claude Code? A mess? Good results? Code bloat? Worth trying on existing C programs.

•

qarl 4 days ago

I ran a test where I added a "light" mode to xscreensaver: unique changes to over 270 different C programs.

It mostly did an amazing job in a short period of time.

EDIT: Of course I get downvoted for saying this. HN isn't interested in reality any more.

•

krupan 4 days ago

These stories with no details and no proof are not interesting or helpful.

•

qarl 4 days ago

My apologies. I did not want to be downvoted for promoting my own material.

https://github.com/qarl/qscreensaver

•

Animats 4 days ago

A diff.[1]

[1] https://github.com/qarl/qscreensaver/commit/2843caba683495d5...

•

ninjin 4 days ago

> Of course I get downvoted for saying this. HN isn't interested in reality any more.

I suspect that rather many of us are simply just tired of Claude and friends getting shoehorned into any conversation about programming at this point. It is about as fun as the Rust Brigade entering any discussion about C. It adds nothing new to the discussion and it is frankly tiring since we pretty much at any time have a handful of conversations on the front page already covering "AI" topics anyway (counting four at the time of writing this).

•

qarl 4 days ago

Well - except in this conversation it's incredibly relevant. It took six years to do this work when the work is likely mostly mechanical and could have been done much more quickly and safely with an automated system.

I thought automation would be interesting to HN - given the context and the fact it was not used.

•

krupan 4 days ago

An LLM is not a mechanical automated system. A deterministic search and replace would be a mechanical automated system. Clearly it wasn't that simple of a problem though.

•

qarl 4 days ago

> An LLM is not a mechanical automated system.

Pretty sure that's exactly what LLMs in coding harnesses are.

•

krupan 3 days ago

Still not deterministic

•

qarl2 2 days ago

You understand that computer programs are deterministic unless someone explicitly injects non-determinism in them, right?

Even when they implement LLMs.

•

qarl2 21 hours ago

To the people downvoting me - I'm sorry, but it's true.

I know ChatGPT seems like it's non-deterministic - but that just a user preference.

The only real issue is if you have many people on the same server, the GPU contention can be non-deterministic in rare cases.

LLMs can be deterministic if you need them to be - it's just that most people prefer the human-like interface.

•

Animats 4 days ago

Right. Six years of work on a grunt job. That's what automation is for.

•

larodi 4 days ago

Wonder when is someone going to brave and fork the linux kernel and try to ffwd it with automatic programming.

•

fragmede 4 days ago

why would you start there instead of creating something from scratch ?if you can port drivers just as easily meaning you don't especially give a shit about hardware you're running on in the first place, why even deal with linux? The battle tested LRU cache system?

•

convolvatron 4 days ago

I've seen several workalike kernels in various stages of completion. at least one of them was able to run some pretty substantial applications (Postgres, nginx, that kind of thing), and that is still I guess around 250kloc. but it only really has drivers to support hypervisor devices.

unfortunately as time goes by, the linux api surface gets larger and more convoluted. so there's going to be some coverage you're just never going to get.

but in the abstract, definitely. linux is so bloated at this point that its not clear that it can ever be 'made safe'.

•

larodi 3 days ago

Some if not most coverage will be off, indeed, but then the important stuff can get you lots of benefits. This makes sense even today for selectively patching the kernel. I’m sure many people been odd by the complexity of it while now it is doable albeit with agents…

•

literalAardvark 4 days ago

It's much easier to use something with all the edge cases already handled as a starting point.

•

larodi 3 days ago

Well in reality if you want a custom OS perhaps scavenging parts is a thing to do indeed. I just speculated whether Linux can be further improved by automatic programming and still keep the handmade parts.

•

bigstrat2003 3 days ago

Going fast is only good if you're going the right direction. And with LLMs, more often than not you're going off a cliff.