Is there a Linux command to find out if a file is UTF-8?

Question

The Joomla .ini files require to be saved as UTF-8.

After editing I'm not sure if the files are UTF-8 or not.

Is there a Linux command like file or a few commands that would tell if a file is indeed UTF-8 or not?

You cannot tell the encoding of a file. You can only make a smart guess. You might mostly guess right, but sometimes guesses fail. `file` is an example of a program doing smart guesses. — Marco, Sep 24 '13 at 21:17
@Marco: It is possible to verify whether it is valid UTF-8 or not, however. There are *some* encodings which can mistakenly pass as valid UTF-8, but it almost never happens with ISO-8859- or Windows-125 encodings/charsets. — u1686_grawity, Sep 24 '13 at 21:40

score 30 · Answer 1 · answered Sep 24 '13 at 20:54

30

You can determine the file encoding with the following command:

file -bi filename

answered Sep 24 '13 at 20:54

Rik

13,159
1
34
41

This answer should be accepted. The explanation for the -bi options is in the [man file](http://linuxcommand.org/man_pages/file1.html). – Jérôme Jan 13 '16 at 14:04
is it supposed to work on macos as well ? I get `regular file` on a file I though was utf8 – nicolas Apr 24 '16 at 15:49
4

@nicolas For MacOS you could try `file -I filename` (-I is a capital i). – Rik Apr 24 '16 at 16:07
@Rik I can confirm – nicolas Apr 24 '16 at 16:08
6

Does this read the whole file? – ctrl-alt-delor Mar 30 '18 at 15:17
@ctrl-alt-delor What do you mean read the whole file? It shouldn't have to as the file encoding is probably placed in the header of the file. – kojow7 Apr 20 '18 at 15:33
2

@kojow7 utf-8 has no header. Pure ASCII (7-bit only), is indistinguishable from utf-8 (that is the point of it, a header will cause all sorts of problems). So if you have a file that is ASCII for the first MB then has a single UTF-8 character, then you will not know, unless you read the whole file. – ctrl-alt-delor Apr 21 '18 at 16:41
@kojow7 because if you only read a few bytes (3 are enough for the UTF-8 BOM) then the rest of the file can be, say, a PNG and thus not a valid UTF-8 file. – Alexis Wilke Dec 28 '18 at 10:11
8

This should not be accepted as the answer. The 'file' command does not do that ; it reads only part of the file and uses magic numbers to take a best guess. On occasion 'file' can and will give you the incorrect answer. To verify if a file passes an encoding such as ascii, iso-8859-1, utf-8 or whatever then a good solution is to use the 'iconv' command. – Tim Mar 09 '20 at 14:45
@Tim And it isn't the accepted answer (yet). And according to [wiki](https://en.m.wikipedia.org/wiki/File_(command)) the entire file **is** read, **if** necessary. `5. the entire file is considered and file is to use context-sensitive tests`. Haven't tested it myself though. – Rik Mar 09 '20 at 18:54
2

I have tested it, and it can and does fail. – Tim Mar 10 '20 at 17:18
I should add, to verify if a file passes an encoding scheme, be it UTF8, 8859-1, ASCII, or whatever, it is *always* necessary to read the whole file, every last byte; otherwise you are guessing. – Tim Mar 10 '20 at 20:30
I prefer `file -i *` to find out which files are UTF-8 encoded, and which are not. But that's just a slight variant on how to use the `file` command … :-) – Henke Mar 11 '23 at 17:37
If `file -bi myfile.txt` says `text/plain; charset=us-ascii`, that is UTF-8 too, since ASCII is a subset of UTF-8. – Gabriel Staples Aug 04 '23 at 02:28

score 19 · Answer 2 · edited Jul 25 '16 at 09:07

19

There is, use the isutf8 command from the moreutils package.

Source: How can you tell if a file is UTF-8 encoded or not?

edited Jul 25 '16 at 09:07

DavidPostill

153,128
77
353
394

answered Feb 21 '16 at 22:02

Pablo Olmos de Aguilera C.

571
5
12

@davidpostill I'm curious, is bad practice to cite the author in the reference? – Pablo Olmos de Aguilera C. Aug 28 '16 at 20:26
1

No. However, it is *good* practice to make the link say where it leads me. Assume I'm reading only the blue text. After the edit, I can tell why and when I should click that. Before, I could not. (It wasn't me who made the edit but I'm like 94% sure that this is what it was about.) – Hermann Döppes Dec 31 '18 at 00:00
Nice, and works nicely with `find -type f -exec isutf8 {} +`, because it also quotes the filename. (And with using `find ... -exec ... +` is also fast) – Tomasz Gandor Mar 22 '19 at 13:28

score 15 · Answer 3 · edited Mar 09 '20 at 15:18

Do not use the file command. It does not inspect the whole file, and it basically guesses. It sometimes gives incorrect answers.

You can verify if a file happens to pass UTF-8 encoding like this:

$ iconv -f utf8 <filename> -t utf8 -o /dev/null

A return code of zero means it passes UTF8. A non-zero return code means it is not valid UTF8.

It is not possible to know if a file was necessarily exported using any particular encoding scheme, as some encoding schemes overlap. To do that would require metadata to be embedded in the file, and even then you would be placing trust in whoever generated that file, rather than validating it yourself... and you should always validate it yourself.

score 1 · Answer 4 · answered Feb 20 '20 at 20:37

Yet another way is to use recode, which will exit with an error if it tries to decode UTF-8 and encounters invalid characters.

if recode utf8/..UCS < "$FILE" >/dev/null 2>&1; then
    echo "Valid utf8 : $FILE"
else
    echo "NOT valid utf8: $FILE"
fi

Is there a Linux command to find out if a file is UTF-8?

4 Answers4