How do I grep for arabic characters with diacritical marks?

Question

I have large TXT files in arabic Tashkil and I'm trying to find lines that contain specific pattern mashkula with َ ً ُ ٌ ّ ْ ٍ , I've tried the following grep syntax:

cat file.txt | grep "اهلا"

This returns nothing until I insert Tashkil marks:

cat file.txt | grep "أهْلاً"

I get the correct output

أهْلاً

I also tried

grep -P "[ُ\ ّ\ َ\ ً\ ِ\ ٍ\ ٌ\ ْ\ \~]|[اهلا]" file.txt

And this returns all matching characters in different patterns:

أهْلاً أ ... هْ.. لًا أنْتَ لَيْلاً ..

How to match arabic diacritical marks with grep? Is it possible to remove Tashkil marks from text before using grep? My OS is Ubuntu 18.04

UPDATE: At this moment, I remove Tashkil marks from text with: sed "s/[ُ ّ َ ً ِ ٍ ٌ ْ]//g", then I can grep what I want. But in this approach, sed command removes spaces from all text!

I wonder if you can do it by replacing each character by its collation class? See for example [How to do an accent insensitive grep?](https://stackoverflow.com/a/27980272/4440445) — steeldriver, Apr 13 '22 at 20:58
unfortunately, this didn't solve the issue. pattern without diacritical marks still not matching neither "grep -i" nor perl — s3idani, Apr 13 '22 at 21:23
Try the below according to: https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters/ grep -P -n "أهْلاً" file.txt — Grishon, Apr 13 '22 at 20:35
pattern with diacritical marks (Tashkil) is already matching. same pattern without Tashkil in grep -i returns nothing !! — s3idani, Apr 13 '22 at 20:43

Pablo Bianchi · Accepted Answer · 2022-04-26T19:04:21.347

5

Assuming UTF-8 source and locale, removing U+064B-U+065B range using Perl:

$ echo "أَهْلاً وَ سَهْلاً" | perl -CSAD -pe 's/[\x{064B}-\x{065B}]//g'

أهلا و سهلا

Source: This works because vowel diacritics in Arabic are combining characters, meaning that a simple search and remove of these should be enough.

GNU sed also seems to work (note that based on those answers there are other diacritics):

$ echo "أَهْلاً وَ سَهْلاً" | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g'

أهلا و سهلا

uconv might also work.

Check the comments area of this and s3idani's post for more info.

Other sources

edited Apr 26 '22 at 19:04

answered Apr 16 '22 at 21:04

Pablo Bianchi

14,308
4
74
117

Nice answer :) ... that worked directly like so `sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g' file | grep -n أهلا` and in a function containing this `grep -n "$1" <(sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g' "$2")` where `$1` is the first argument: word and `$2` is the second argument: filename like so `argrep أهلا file`... I wonder if such function can be improved to return the original matching lines with diacritics in them by for example reading lines by numbers that `grep -n` returns from the original file again? – Raffa Apr 16 '22 at 22:58
@Pablo Bianchi The perl example requires 11 hex entitys of all arabic diacritical characters, one by one; x064B x064C x064D x064E x064F x0650 x0651 x0652 x0653 x0654 x0655 ً ٌ ٍ َ َُ ّ ْ ˜ ٓ ٕ ٔ – s3idani Apr 16 '22 at 23:24
@Pablo Bianchi And yes maybe i shoud accept your answer since `grep` command cant match directly diacritics exept with your `sed` `perl` example. – s3idani Apr 16 '22 at 23:42
Well done! ... both methods now give the same expected result ... they are both working and are identical. – Raffa Apr 17 '22 at 17:17
@Raffa I don't know anything about arabic, but [reading this](https://unicode-table.com/en/blocks/arabic/), should be removed up to U+065F, including _Other combining marks_? – Pablo Bianchi Apr 18 '22 at 05:54
I like the way you do your research ... but not all characters in your linked document are used in the Arabic language (We only use 28 letters and around 11 diacritics) ... Diacritics, however, are used lightly in everyday writing [example newspaper article](https://www.alriyadh.com/1863861) ... Arabic script is [the 3rd most used script](https://en.wikipedia.org/wiki/List_of_writing_systems#List_of_writing_systems_by_adoption) ... It is [widely used by non-Arabs](https://en.wikipedia.org/wiki/Arabic_script) and sometimes with a few modifications/additions including letters/dicratics. ... – Raffa Apr 18 '22 at 18:16
... Thus your answer will help computer users in all those Languages/countries ... Arabic alphabet has undergone many changes throughout its [history](https://en.wikipedia.org/wiki/History_of_the_Arabic_alphabet) ... IMO, the set of diacritics in your answer "especially the perl one" are enough for general use everyday Arabic script for Arabic language speakers ... That said, I think if you chose to enrich your answer with some of the information in my comments, that would make it Global and address a much wider audience. This is only a suggestion but your answer is very useful as it is. – Raffa Apr 18 '22 at 18:16

s3idani · Answer 2 · 2022-04-25T21:23:49.640

1

Based on Pablo Bianchi's answer, Here's the workaround:

Text: أَهْلاً وَ سَهْلاً

Command: cat Text | sed -e 's/َ//g;s/ُ//g;s/ِ//g;s/ّ//g;s/ً//g;s/ٌ//g;s/ٍ//g;s/ْ//g;s/أ/ا/g;s/آ/ا/g;s/إ/ا/g' | grep -o "اهلا"

Output: اهلا

edited Apr 25 '22 at 21:23

answered Apr 17 '22 at 00:17

s3idani

355
1
8

[I wouldn't use `tr`](https://unix.stackexchange.com/a/228570/209677) with non-ASCII characters. – Pablo Bianchi Apr 17 '22 at 01:26
@Pablo Bianchi I replaced `tr` command with full `sed` syntax and now all diacritical marks are removed. – s3idani Apr 25 '22 at 21:32
I don't know why did you replace those three characters at the end, why removing vowel diacritics (combining characters) isn't enough – Pablo Bianchi Apr 25 '22 at 22:02
@Pablo Bianchi in arabic language, `أ إ آ ؤ ء ئ` called "Hamza" not Tashkil marks. – s3idani Apr 25 '22 at 22:43
I assume removing [maddah and hamza characters (U+0653 U+0654 U+0655)](https://unicode-table.com/en/blocks/arabic/) (BTW, three symbols, not six, is wrong that list?) is not enaough. [This is wrong](https://stackoverflow.com/a/25563250/4970442) and not all vowel diacritics in Arabic are combining characters? Maybe that's why [here](https://stackoverflow.com/a/56328226/4970442) are more ranges. – Pablo Bianchi Apr 26 '22 at 01:52
Im talking about arabic Tashkil simple text from ISO 8859-6, here is a Quran example: `أَتَى أَمْرُ اللَّهِ فَلَا تَسْتَعْجِلُوهُ سُبْحَانَهُ وَتَعَالَى عَمَّا يُشْرِكُو` in this case `sed` command responds to my need; `اتى امر الله فلا تستعجلوه سبحانه وتعالى عما يشركو` and yes, in other Quranic annotation it does not. – s3idani Apr 26 '22 at 18:51
Here's other example with `أ إ آ` in `إِنْ أَحْسَنْتُمْ أَحْسَنْتُمْ لِأَنْفُسِكُمْ وَإِنْ أَسَأْتُمْ فَلَهَا فَإِذَا جَاءَ وَعْدُ الْآخِرَةِ لِيَسُوءُوا وُجُوهَكُمْ وَلِيَدْخُلُوا الْمَسْجِدَ كَمَا دَخَلُوهُ أَوَّلَ مَرَّةٍ وَلِيُتَبِّرُوا مَا عَلَوْا تَتْبِيرًا` and this is `sed` output: `ان احسنتم احسنتم لانفسكم وان اساتم فلها فاذا جاء وعد الاخرة ليسوءوا وجوهكم وليدخلوا المسجد كما دخلوه اول مرة وليتبروا ما علوا تتبيرا – s3idani Apr 26 '22 at 19:21
Sorry, but examples doesn't help much for me. I only wonder if there are other ranges of Unicode chars I should remove with the Perl command (which has the advantage that it's not necessary to input non-ASCII chars on terminal), and if it's absolutely necessary to replace instead of delete certain combining chars? – Pablo Bianchi Apr 26 '22 at 22:18
If you mean combining hamza `U+0655` ` ٕ` `U+0654` ` ٔ` and maddah `U+0653` ` ٓ` I don't think it's possible to replace them with other chars but you can remove them (and all diacritics) in hand writing, text still already readable for arabic speaking people. The full Tashkil/diacritics are used strictly in Quranic readings. – s3idani Apr 26 '22 at 23:42

How do I grep for arabic characters with diacritical marks?

2 Answers2

Other sources