How does a computer differentiate '\0' (null character) from "unsigned int = 0"?

Question

If in a given situation, you have an array of chars (ending of course with the null character) and just after that, in the immediate next position in memory, you want to store 0 as an unsigned int, how does the computer differentiate between these two?

You're asking about typical computers about which the answers are completely right. However, there used to be _some_ architectures which use [tagged memory](https://en.wikipedia.org/wiki/Tagged_architecture) to distinguish between data types. — u1686_grawity, Oct 01 '18 at 12:09
The same way the computer cannot differentiate a 4 byte float from a 4 byte integer (reperesenting a very different number). — Hagen von Eitzen, Oct 01 '18 at 14:47
While ending a string with 0x00 is common, there are languages which use length-prefixed strings. The first byte or two would contain the number of bytes in the string. In this way, a 0x00 at the end is not needed. I seem to recall Pascal and BASIC doing that. Perhaps COBOL as well. — lit, Oct 02 '18 at 13:57
@lit also header formats in many communication protocols. "Hello I am this kind of message and I am this many bytes long". Often because you need to store complex data types inside, then null termination becomes much more troublesome to parse. — mathreadler, Oct 03 '18 at 18:11
@lit: Most variants of Pascal and BASIC yes, and PL/I and Ada -- and in Java since substring sharing was dropped in 7u6 effectively uses the array length prefix -- but COBOL only sort-of: you can _read_ data from `pic X occurs m to n depending on v` (and the count can be anywhere, not just immediately before), but _storing_ it is more complicated. — dave_thompson_085, Oct 03 '18 at 22:05

Jamie Hanrahan · Answer 1 · 2018-10-30T23:38:25.917

88

It doesn't.

The string terminator is a byte containing all 0 bits.

The unsigned int is two or four bytes (depending on your environment) each containing all 0 bits.

The two items are stored at different addresses. Your compiled code performs operations suitable for strings on the former location, and operations suitable for unsigned binary numbers on the latter. (Unless you have either a bug in your code, or some dangerously clever code!)

But all of these bytes look the same to the CPU. Data in memory (in most currently-common instruction set architectures) doesn't have any type associated with it. That's an abstraction that exists only in the source code and means something only to the compiler.

Edit-added: As an example: It is perfectly possible, even common, to perform arithmetic on the bytes that make up a string. If you have a string of 8-bit ASCII characters, you can convert the letters in the string between upper and lower case by adding or subtracting 32 (decimal). Or if you are translating to another character code you can use their values as indices into an array whose elements provide the equivalent bit coding in the other code.

To the CPU the chars are really extra-short integers. (eight bits each instead of 16, 32, or 64.) To us humans their values happen to be associated with readable characters, but the CPU has no idea of that. It also doesn't know anything about the "C" convention of "null byte ends a string", either (and as many have noted in other answers and comments, there are programming environments in which that convention isn't used at all).

To be sure, there are some instructions in x86/x64 that tend to be used a lot with strings - the REP prefix, for example - but you can just as well use them on an array of integers, if they achieve the desired result.

edited Oct 30 '18 at 23:38

answered Oct 01 '18 at 10:22

Jamie Hanrahan

23,140
6
63
92

14

That's why developers have to be careful with strings. If you have, say, 100 consecutive bytes, you can fit at most 99 1-byte characters in there plus the terminator in last byte. If you write a 100-byte string in there, program won't be able to figure out that the string ends there and will continue reading consecutive bytes until a coincidental zero byte. If the string is more than 100 bytes long, it will overwrite some adjacent data. High-level programming languages (Java, C#, JS etc.) take care of this themselves, but in low-level langs such as C, C++, assembly it's dev's responsobility. – gronostaj Oct 01 '18 at 10:30
18

@gronostaj Your comment is slightly confusing: Unlike in C, C++ strings also take care of this automatically. C++ is also not generally classified as a low-level language (and even C sometimes isn’t). – Konrad Rudolph Oct 01 '18 at 13:46
5

There are (old) CPU architectures that have type markers on data values, so dereferencing an integer as a pointer will give an exception. – Simon Richter Oct 01 '18 at 14:17
@SimonRichter That sounds interesting. Can you mention some examples? – Jamie Hanrahan Oct 01 '18 at 15:14
2

Re "The string terminator is a byte containing all 0 bits.", that's not universally true. Depends on language: the '\0' terminator is inherited from C. In Fortran, though, a string is essentially a structure with a separate length field. Newer languages seem to use basically one or the other - or sometimes both, as in C++ where you can have normal C "array of char ending in '\0'" strings, or C++ string classes that contain hidden data like length. Other languages may use something different from either... – jamesqf Oct 01 '18 at 17:07
8

@JamieHanrahan The IA64 processor [has a bit called NaT](https://blogs.msdn.microsoft.com/oldnewthing/20040119-00/?p=41003) (or "Not a Thing") that can throw an exception if a value has it set. – ErikF Oct 01 '18 at 18:35
1

@JamieHanrahan, it's called a [tagged architecture](https://en.wikipedia.org/wiki/Tagged_architecture), the article lists the Lisp machine and one of the Burroughs mainframes. – Simon Richter Oct 01 '18 at 19:25
4

@KonradRudolph "automatic" doesn't mean "foolproof", certainly not in C++ – rackandboneman Oct 01 '18 at 20:06
1

@jamesqf It's true in the context of the OQ, which is what I was answering. But, yes: I'm familiar with environments that use descriptors. e.g. I worked for many years around VMS, in which the common calling standard used descriptors for things like strings: address, current length, and allocated length. The RTL made it work transparently across all languages. Many times I have wished that C had copied that... – Jamie Hanrahan Oct 01 '18 at 21:08
3

@KonradRudolph: C++ can use implicit-length C-strings. Some of the constructors for `std::string` take a `const char*` arg. If you completely filled a previously-zeroed buffer with a `read()` or `recv()` system call and then foolishly used the `basic_string( const CharT* s)` constructor instead of the pointer/length or `basic_string(InputIt first, InputIt last)` constructors using the length return value from `read`, you'd have a problem. Many C++ programs / libraries don't live in C++ utopia where they only interact with idiomatic C++ functions/data using `std::string`. – Peter Cordes Oct 01 '18 at 21:33
2

@KonradRudolph The low-level/high-level distinction is blurry. Compared to asm, even C is high-level. Nowadays, when majority of software is built on web tech or other garbage-collected platforms, anything that lets you accidentally overwrite adjacent memory region is arguably lower-level than average. That's what I meant when writing that comment. And in 10 years we may consider Java low-level just because of all the boilerplate it often requires. Loosely related fun fact: https://superuser.com/questions/638675/why-does-ram-have-to-be-volatile/638694#comment798457_638694 – gronostaj Oct 02 '18 at 09:37
@gronostaj I totally agree about the blurry distinction (and hinted at that in the original comment; at any point I’d agree that C, C++ and even sometimes Java are hopelessly low-level compared to what’s possible). But the point wasn’t about that, it was about manual zero-termination and this is only necessary in C, not in C++ (nor other languages, unless you deal with legacy C APIs). You *can* accidentally override buffers in some other languages (including C++) but grouping C++ with C in this regard seems contrived. – Konrad Rudolph Oct 02 '18 at 09:43
@SimonRichter There is also at least one modern architecture that uses a form of tagging, particularly the Power CPU using the IBM i OS. I guess it is mostly the OS, or more accurately, the technology independent machine interface that does it, but a pointer cannot be anything but a pointer, and nothing but a pointer can be used as a valid pointer. But I think these protections go away if you are running AIX or Linux on a Power CPU. – jmarkmurphy Oct 02 '18 at 11:23
2

"*That's an abstraction that […] means something only to the compiler.*" - and the programmer, I hope. – Bergi Oct 02 '18 at 19:35
@ErikF: Only registers have the 65th bit, not memory. So it doesn't help with tagging stored values. – rici Oct 03 '18 at 05:41

score 5 · Answer 2 · answered Oct 01 '18 at 10:27

In short there is no difference (except that an int is 2 or 4 bytes wide and a char just 1).

The thing is that all modern libaries either use the null terminator technique or store the length of a string. And in both cases the program/computer knows it reached the end of a string when it either read a null character or it has read as many characters as the size tells it to.

Issues with this start when the null terminator is missing or the length is wrong as then the program starts reading from memory it isn't supposed to.

Oh, there is a difference in short - actually, short is kind of notorious for being a very machine dependent data type :) — rackandboneman, Oct 01 '18 at 20:07

score 2 · Answer 3 · answered Oct 02 '18 at 10:22

There is no difference. Machine code (assembler) does not have variable types, instead the type of the data is determined by the instruction.

A better example would be int and float, if you have 4 bytes in memory, there is no info of whether it's an int or a float (or something else entirely), however there are 2 different instructions for integer addition and float addition, so if the integer addition instruction is used on the data, then it's an integer, and vice versa.

Same with strings, if you have code that, say, looks at an address and counts bytes until it reaches a \0 byte, you can think of it as a function computing string's length.

Of course programming like this would be complete madness, so that's why we have higher level languages that compile to machine code and almost noone programs in assembler directly.

score 2 · Answer 4 · answered Oct 03 '18 at 12:07

The scientific single word answer would be: metadata.

The metadata tells the computer whether some data at a certain location is an int, a string, program code or whatever. This metadata can be part of the program Code (as Jamie Hanrahan mentioned) or it can be explicitly stored somewhere.

Modern CPUs can often distinguish between memory regions assigned to program code and data regions (for example, the NX Bit https://en.wikipedia.org/wiki/NX_bit). Some exotic hardware can also distinguish between strings and numbers, yes. But the usual case is that the Software takes care of this issue, either though implicit metadata (in the code) or explicit metadata (object-oriented VMs often store the metadata (type/class information) as part of the data (object)).

An advantage of not distinguishing between different kinds of data is that some operations become very simple. The I/O subsystem does not necessarily need to know whether the data it just reads from or writes to disk is actually program code, human readable text or numbers. It's all just bits which get transported through the machine. Let the program code deal with the fancy typing issues.

score 1 · Answer 5 · answered Oct 03 '18 at 21:49

It doesn't. You do it!

Or your compiler/interpreter.

If instructions tell computer to add the 0 as a number, it'll do it. If they tell computer to stop to print data after reach the 0, as a '\0' char, it'll do it.

Languages have mechanisms to ensure how to treat data. In C variables have types, like int, float and char, and compiler generate right instructions to each data type. But C allows you cast data from a variable to another variable of different type, even a pointer to can be used as a number. To computer it's all bits like any other.

score 0 · Answer 6 · answered Oct 04 '18 at 07:35

0

A null character is one byte and an unsigned int is two bytes.

answered Oct 04 '18 at 07:35

Quentin 2

101
1

How does a computer differentiate '\0' (null character) from "unsigned int = 0"?

6 Answers6