92

There are many plain text files which were encoded in variant charsets.

I want to convert them all to UTF-8, but before running iconv, I need to know its original encoding. Most browsers have an Auto Detect option in encodings, however, I can't check those text files one by one because there are too many.

Only having known the original encoding, I then can convert the texts by iconv -f DETECTED_CHARSET -t utf-8.

Is there any utility to detect the encoding of plain text files? It DOES NOT have to be 100% perfect, I don't mind if there're 100 files misconverted in 1,000,000 files.

Lenik
  • 17,942
  • 25
  • 87
  • 119

11 Answers11

80

Try the chardet Python module, which is available on PyPI:

pip install chardet

Then run chardetect myfile.txt.

Chardet is based on the detection code used by Mozilla, so it should give reasonable results, provided that the input text is long enough for statistical analysis. Do read the project documentation.

As mentioned in comments it is quite slow, but some distributions also ship the original C++ version as @Xavier has found in https://superuser.com/a/609056. There is also a Java version somewhere.

u1686_grawity
  • 426,297
  • 64
  • 894
  • 966
  • 5
    Yes, and it's already packaged as `python-chardet` in Ubuntu universe repo. – Lenik Jun 25 '11 at 06:21
  • If it wasn't a perfect guess, `chardet` will still give the most correctly guess, like `./a.txt: GB2312 (confidence: 0.99)`. Compared to Enca which just failed and report 'Unrecognized encoding'. However, sadly enough, `chardet` runs very slow. – Lenik Jun 25 '11 at 06:48
  • 2
    @谢继雷: Have it run overnight or something like that. Charset detection *is* a [complicated process](http://chardet.feedparser.org/docs/faq.html). You could also try the Java-based jChardet or ... the original *chardet* is [part of Mozilla](http://www-archive.mozilla.org/projects/intl/chardet.html), but only C++ source is available, no command-line tool. – u1686_grawity Jun 25 '11 at 12:13
  • 2
    Regarding speed: running `chardet <(head -c4000 filename.txt)` was much faster and equally successful for my use-case. (in case it's not clear this bash syntax will send only the first 4000 bytes to chardet) – ndemou Dec 26 '15 at 19:32
  • 2
    @ndemou I have `chardet==3.0.4`, and the command line tool's actual executable name is `chardetect` not `chardet`. – Devy Mar 26 '18 at 14:26
  • It's not very precise, unfortunately. I tested it with several subtitle files in different languages but it's not reliable. It gets many encodings wrong... – gignu Mar 14 '21 at 23:03
41

I would use this simple command:

encoding=$(file -bi myfile.txt)

Or if you want just the actual character set (like utf-8):

encoding=$(file -b --mime-encoding myfile.txt)
Kankaristo
  • 103
  • 1
  • 5
  • 8
    Unfortunately, `file` only detects encodings with specific properties, such as UTF-8 or UTF-16. The rest -- oldish ISO8859 or their MS-DOS and Windows correspondents -- are listed as "unknown-8bit" or something similar, even for files which `chardet` detects with 99% confidence. – u1686_grawity Oct 28 '11 at 19:09
  • 9
    file showed me iso-8859-1 – cweiske Mar 30 '12 at 07:22
  • What if the extension is lying? – james.garriss Oct 03 '14 at 13:24
  • 3
    @james.garriss: file extension has nothing to do with its (text) content encoding. – MestreLion Nov 28 '14 at 12:18
  • In case some noob like me gets confused as to how to retrieve the value that we're storing in "encoding", simply type: "echo $encoding". Alternatively, you can use the part inside the parentheses by itself, without storing any variables: "file -bi myfile.txt". – gignu Mar 14 '21 at 22:30
34

On Debian-based Linux, the uchardet package (Debian / Ubuntu) provides a command line tool. See below the package description:

 universal charset detection library - cli utility
 .
 uchardet is a C language binding of the original C++ implementation
 of the universal charset detection library by Mozilla.
 .
 uchardet is a encoding detector library, which takes a sequence of
 bytes in an unknown character encoding without any additional
 information, and attempts to determine the encoding of the text.
 .
 The original code of universalchardet is available at
 http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet
 .
 Techniques used by universalchardet are described at
 http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
rofrol
  • 1,849
  • 19
  • 17
Xavier
  • 484
  • 4
  • 3
  • 3
    Thanks! From the project's homepage it wasn't obvious to me that there was a CLI included. It's also available on OS X when installing `uchardet` via Homebrew. – Stefan Schmidt Jul 06 '13 at 14:47
  • 1
    I was a little confused at first because a ISO 8859-1 document was falsely identified as Windows-1252 but in the printable range Windows-1252 is a superset of ISO 8859-1 so conversion with `iconv` works fine. – Stefan Schmidt Jul 06 '13 at 14:56
  • Hi @StefanSchmidt, *charset detection* is based on *corpus* statistics. In countries like Brazil the old official standard encode (before UTF8) was ISO-8859-1... But it is a "Microsoft country", so the most popular in any statistics is the Microsoft's distortion of ISO (against competition), that is Windows-1252. It is a side effect of the technological colonialism in a country: weak sovereignty, weak official standards and strong Microsoft lobbying. Even now (2021 is [1 decade after universal adoption](https://en.wikipedia.org/wiki/UTF-8#Adoption)!) it is dominant in TXT and CSV files. – Peter Krauss Apr 09 '21 at 10:42
17

For Linux, there is enca and for Solaris you can use auto_ef.

rofrol
  • 1,849
  • 19
  • 17
cularis
  • 1,239
  • 8
  • 10
  • Enca seems too strict for me: `enca -d -L zh ./a.txt` failed with message `./a.txt: Unrecognized encoding Failure reason: No clear winner.` As @grawity mentioned, `chardet` is more lax, however it's yet too slow. – Lenik Jun 25 '11 at 07:06
  • 13
    Enca completely fails the "actually does something" test. – Michael Wolf Mar 01 '12 at 18:59
  • 2
    uchardet failed (detected CP1252 instead of the actual CP1250), but enca worked fine. (single example, hard to generalize...) – Palo Nov 16 '15 at 20:52
3

For those regularly using Emacs, they might find the following useful (allows to inspect and validate manually the transfomation).

Moreover I often find that the Emacs char-set auto-detection is much more efficient than the other char-set auto-detection tools (such as chardet).

(setq paths (mapcar 'file-truename '(
 "path/to/file1"
 "path/to/file2"
 "path/to/file3"
)))

(dolist (path paths)
  (find-file path)
  (set-buffer-file-coding-system 'utf-8-unix)
  )

Then, a simple call to Emacs with this script as argument (see the "-l" option) does the job.

2

Mozilla has a nice codebase for auto-detection in web pages:
http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/

Detailed description of the algorithm:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

1

If you're unhappy with chardet like I am because it doesn't properly recognize some encodings, try out detect-file-encoding-and-language. I've found it to be a lot more reliable than chardet.

1. Make sure you have Node.js and NPM installed. You can install them like this:

$ sudo apt install nodejs npm

2. Install detect-file-encoding-and-language:

$ npm install -g detect-file-encoding-and-language

3. Now you can use it to detect the encoding:

$ dfeal "/home/user name/Documents/subtitle file.srt"

It'll return an object with the detected encoding, language, and a confidence score.

Falaen
  • 151
  • 4
1

UTFCast is worth a try. Didn't work for me (maybe because my files are terrible) but it looks good.

http://www.addictivetips.com/windows-tips/how-to-batch-convert-text-files-to-utf-8-encoding/

Sameer Alibhai
  • 249
  • 1
  • 3
  • 8
1

Getting back to chardet (python 2.?) this call might be enough:

python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())' < file
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}

Though it's far from perfect....

echo "öasd" | iconv -t ISO-8859-1 | python -c 'import chardet,sys; print chardet.detect(sys.stdin.read())'
{'confidence': 0.5, 'encoding': 'windows-1252'}
estani
  • 806
  • 1
  • 8
  • 12
1

isutf8 (from the moreutils package) did the job

Ronan
  • 209
  • 2
  • 2
  • 2
    How? This answer isn't really helpful. –  Oct 28 '15 at 19:02
  • 2
    It's not exactly was asked, but is a useful tool. If the file is valid UTF-8, the exit status is zero. If the file is not valid UTF-8, or there is some error, the exit status is non-zero. – ton Feb 16 '16 at 17:34
0

Also in case you file -i gives you unknown

You can use this php command that can guess charset like below :

In php you can check like below :

Specifying encoding list explicitly :

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), 'UTF-8, ASCII, JIS, EUC-JP, SJIS, iso-8859-1') . PHP_EOL;"

More accurate "mb_list_encodings":

php -r "echo 'probably : ' . mb_detect_encoding(file_get_contents('myfile.txt'), mb_list_encodings()) . PHP_EOL;"

Here in first example, you can see that i put a list of encodings (detect list order) that might be matching. To have more accurate result you can use all possible encodings via : mb_list_encodings()

Note mb_* functions require php-mbstring

apt-get install php-mbstring 

See answer : https://stackoverflow.com/a/57010566/3382822