Hubbry Logo
Arabic script in UnicodeArabic script in UnicodeMain
Open search
Arabic script in Unicode
Community hub
Arabic script in Unicode
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Arabic script in Unicode
Arabic script in Unicode
from Wikipedia

Many scripts in Unicode, such as Arabic, have special orthographic rules that require certain combinations of letterforms to be combined into special ligature forms. In English, the common ampersand (&) developed from a ligature in which the handwritten Latin letters e and t (spelling et, Latin for and) were combined.[1] The rules governing ligature formation in Arabic can be quite complex, requiring special script-shaping technologies such as the Arabic Calligraphic Engine by Thomas Milo's DecoType.[2]

As of Unicode 17.0, the Arabic script is contained in the following blocks:[3]

The basic Arabic range encodes the standard letters and diacritics, but does not encode contextual forms (U+0621–U+0652 being directly based on ISO 8859-6); and also includes the most common diacritics and Arabic-Indic digits. The Arabic Supplement range encodes letter variants mostly used for writing African (non-Arabic) languages. The Arabic Extended-B and Arabic Extended-A ranges encode additional Qur'anic annotations and letter variants used for various non-Arabic languages. The Arabic Presentation Forms-A range encodes contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The Arabic Presentation Forms-B range encodes spacing forms of Arabic diacritics, and more contextual letter forms. The presentation forms are present only for compatibility with older standards, and are not currently needed for coding text.[4] The Arabic Mathematical Alphabetical Symbols block encodes characters used in Arabic mathematical expressions. The Indic Siyaq Numbers block contains a specialized subset of Arabic script that was used for accounting in India under the Mughal Empire by the 17th century through the middle of the 20th century.[5][6] The Ottoman Siyaq Numbers block contains a specialized subset of Arabic script, also known as Siyakat numbers, used for accounting in Ottoman Turkish documents.[6]

Contextual forms

[edit]

Below is a demonstration for the basic alphabet used in Modern Standard Arabic illustrating how Arabic letters are expected to appear in different contexts. Codepoints listed as contextual forms should "should not be used in general interchange"[4]. Unicode has other methods of encoding the difference if necessary, such as Zero-width joiner.

General
Unicode
Contextual forms Name
Isolated Final (End) Medial (Middle) Initial (Beginning)
0627
ا
FE8D
FE8E
ʾalif
0628
ب
FE8F
FE90
FE92
FE91
bāʾ
062A
ت
FE95
FE96
FE98
FE97
tāʾ
062B
ث
FE99
FE9A
FE9C
FE9B
ṯāʾ
062C
ج
FE9D
FE9E
FEA0
FE9F
ǧīm
062D
ح
FEA1
FEA2
FEA4
FEA3
ḥāʾ
062E
خ
FEA5
FEA6
FEA8
FEA7
ḫāʾ
062F
د
FEA9
FEAA
dāl
0630
ذ
FEAB
FEAC
ḏāl
0631
ر
FEAD
FEAE
rāʾ
0632
ز
FEAF
FEB0
zayn/zāy
0633
س
FEB1
FEB2
FEB4
FEB3
sīn
0634
ش
FEB5
FEB6
FEB8
FEB7
šīn
0635
ص
FEB9
FEBA
FEBC
FEBB
ṣād
0636
ض
FEBD
FEBE
FEC0
FEBF
ﺿ
ḍād
0637
ط
FEC1
FEC2
FEC4
FEC3
ṭāʾ
0638
ظ
FEC5
FEC6
FEC8
FEC7
ẓāʾ
0639
ع
FEC9
FECA
FECC
FECB
ʿayn
063A
غ
FECD
FECE
FED0
FECF
ġayn
0641
ف
FED1
FED2
FED4
FED3
fāʾ
0642
ق
FED5
FED6
FED8
FED7
qāf
0643
ك
FED9
FEDA
FEDC
FEDB
kāf
0644
ل
FEDD
FEDE
FEE0
FEDF
lām
0645
م
FEE1
FEE2
FEE4
FEE3
mīm
0646
ن
FEE5
FEE6
FEE8
FEE7
nūn
0647
ه
FEE9
FEEA
FEEC
FEEB
hāʾ
0648
و
FEED
FEEE
wāw
064A
ي
FEF1
FEF2
FEF4
FEF3
yāʾ
0622
آ
FE81
FE82
ʾalif maddah
0629
ة
FE93
FE94
Tāʾ marbūṭah
0649
ى
FEEF
FEF0
ʾalif maqṣūrah

Punctuation and ornaments

[edit]

Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma ⟨,⟩ which is also used as the decimal separator when the Eastern Arabic numerals are used (e.g. ⟨100.6⟩ compared to ⟨١٠٠,٦⟩).

  • U+060C ، ARABIC COMMA
  • U+060D ؍ ARABIC DATE SEPARATOR
  • U+060E ؎ ARABIC POETIC VERSE SIGN
  • U+060F ؏ ARABIC SIGN MISRA
  • U+061B ؛ ARABIC SEMICOLON
  • U+061E ؞ ARABIC TRIPLE DOT PUNCTUATION MARK
  • U+061F ؟ ARABIC QUESTION MARK
  • U+066D ٭ ARABIC FIVE POINTED STAR
  • U+06D4 ۔ ARABIC FULL STOP
  • U+06DD ۝ ARABIC END OF AYAH
  • U+06DE ۞ ARABIC START OF RUB EL HIZB
  • U+06E9 ۩ ARABIC PLACE OF SAJDAH
  • U+06FD ۽ ARABIC SIGN SINDHI AMPERSAND
  • U+FD3E Arabic ornate left parenthesis
  • U+FD3F ﴿ Arabic ornate right parenthesis

Word ligatures

[edit]

Arabic Presentation Forms-A has a few characters defined as "word ligatures" for terms frequently used in formulaic expressions in Arabic. They are rarely used out of professional liturgical typing, also the Rial grapheme is normally written fully, not by the ligature.

  • U+FDF0 ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN ISOLATED FORM (صلى, stylized as صلے)
  • U+FDF1 ARABIC LIGATURE QALA USED AS KORANIC STOP SIGN ISOLATED FORM (قلى, stylized as قلے)
  • U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM (اللّٰه)
  • U+FDF3 ARABIC LIGATURE AKBAR ISOLATED FORM (اكبر), as in the phrase الله اكبر Allāhu akbar
  • U+FDF4 ARABIC LIGATURE MOHAMMAD ISOLATED FORM (محمد)
  • U+FDF5 ARABIC LIGATURE SALAM ISOLATED FORM (صلعم, the abbreviation for صلى الله عليه وسلم "peace be upon him")
  • U+FDF6 ARABIC LIGATURE RASOUL ISOLATED FORM (رسول)
  • U+FDF7 ARABIC LIGATURE ALAYHE ISOLATED FORM (عليه)
  • U+FDF8 ARABIC LIGATURE WASALLAM ISOLATED FORM (وسلم)
  • U+FDF9 ARABIC LIGATURE SALLA ISOLATED FORM (صلى)
  • U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM (صلى الله عليه وسلم "peace be upon him")
  • U+FDFB ARABIC LIGATURE JALLAJALALOUHOU (جل جلاله)
  • U+FDFC RIAL SIGN (ريال)
  • U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM (بسم الله الرحمن الرحيم bism-i llāh-i r-raḥmān-i r-raḥīm)

Code blocks

[edit]

Arabic

[edit]

Character table

[edit]
Code Result Unicode name
U+0600 ؀   Arabic Number Sign
U+0601 ؁   Arabic Sign Sanah
U+0602 ؂   Arabic Footnote Marker
U+0603 ؃   Arabic Sign Safha
U+0604 ؄   Arabic Sign Samvat

used for writing Samvat era dates in Urdu

U+0605 ؅   Arabic Number Mark Above

may be used with Coptic Epact numbers

U+0606 ؆   Arabic-Indic Cube Root

→ U+221B ∛ Cube Root

U+0607 ؇   Arabic-Indic Fourth Root

→ U+221C ∜ Fourth Root

U+0608 ؈   Arabic Ray
U+0609 ؉   Arabic-Indic Per Mille Sign

→ U+2030 ‰ Per Mille Sign

U+060A ؊   Arabic-Indic Per Ten Thousand Sign

→ U+2031‱ Per Ten Thousand Sign

U+060B ؋   Afghani Sign
U+060C ،   Arabic Comma

also used with Thaana and Syriac in modern text

→ U+002C, Comma

→ U+2E32 ⸲ Turned Comma

→ U+2E41 ⹁ Reversed Comma

U+060D ؍   Arabic Date Separator
U+060E ؎   Arabic Poetic Verse Sign
U+060F ؏   Arabic Sign Misra
U+0610   ؐ   Arabic Sign Sallallahou Alayhe Wassallam

represents sallallahu alayhe wasallam "may God's peace and blessings be upon him"

U+0611   ؑ   Arabic Sign Alayhe Assallam

represents alayhe assalam "upon him be peace"

U+0612   ؒ   Arabic Sign Rahmatullah Alayhe

represents rahmatullah alayhe "may God have mercy upon him"

U+0613   ؓ   Arabic Sign Radi Allahou Anhu

represents radi allahu 'anhu "may God be pleased with him"

U+0614   ؔ   Arabic Sign Takhallus

sign placed over the name or nom-de-plume of a poet, or in some writings used to mark all proper names

U+0615 ؕ   Arabic Small High Tah

marks a recommended pause position in some Qurans published in Iran and Pakistan should not be confused with the small TAH sign used as a diacritic for some letters such as 0679

U+0616   ؖ   Arabic Small High Ligature Alef With Lam With Yeh

early Persian

Arabic Small High Ligature Alef With Yeh Barree

U+0617 ؗ   Arabic Small High Zain
U+0618   ؘ   Arabic Small Fatha

should not be confused with 064E Fatha

U+0619   ؙ   Arabic Small Damma

should not be confused with 064F Damma

U+061A   ؚ   Arabic Small Kasra

should not be confused with 0650 Kasra

U+061B ؛   Arabic Semicolon

also used with Thaana and Syriac in modern text → U+003B ; Semicolon → U+204F ⁏ Reversed Semicolon → U+2E35 ⸵ Turned Semicolon

U+061C   ؜   Arabic Letter Mark (Alm)
U+061D   ؝   Arabic End Of Text Mark
U+061E ؞   Arabic Triple Dot Punctuation Mark
U+061F ؟   Arabic Question Mark

also used with Thaana and Syriac in modern text → U+003F ? Question Mark → U+2E2E ⸮ Reversed Question Mark

U+0620 ؠ   Arabic Letter Kashmiri Yeh
U+0621 ء   Arabic Letter Hamza

→ U+02BE ʾ Modifier Letter Right Half Ring

U+0622 آ   Arabic Letter Alef With Madda Above

≡ آ U+0627 U+0653

U+0623 أ   Arabic Letter Alef With Hamza Above

≡ أ U+0627 U+0654

U+0624 ؤ   Arabic Letter Waw With Hamza Above

≡ ؤ U+0648 U+0654

U+0625 إ   Arabic Letter Alef With Hamza Below

≡ إ U+0627 U+0655

U+0626 ئ   Arabic Letter Yeh With Hamza Above

in Kyrgyz the hamza is consistently positioned to the top right in isolate and final forms ≡ ئ U+064A U+0654

U+0627 ا   Arabic Letter Alef
U+0628 ب   Arabic Letter Beh
U+0629 ة   Arabic Letter Teh Marbuta
U+062A ت   Arabic Letter Teh
U+062B ث   Arabic Letter The
U+062C ج   Arabic Letter Jeem
U+062D ح   Arabic Letter Hah
U+062E خ   Arabic Letter Khah
U+062F د   Arabic Letter Dal
U+0630 ذ   Arabic Letter Thal
U+0631 ر   Arabic Letter Reh
U+0632 ز   Arabic Letter Zain
U+0633 س   Arabic Letter Seen
U+0634 ش   Arabic Letter Sheen
U+0635 ص   Arabic Letter Sad
U+0636 ض   Arabic Letter Dad
U+0637 ط   Arabic Letter Tah
U+0638 ظ   Arabic Letter Zah
U+0639 ع   Arabic Letter Ain

→ U+01B9 ƹ Latin Small Letter Ezh Reversed → U+02BF ʿ MODIFIER LETTER LEFT HALF RING

U+063A غ   Arabic Letter Ghain
U+063B ػ   Arabic Letter Keheh With Two Dots Above
U+063C ؼ   Arabic Letter Keheh With Three Dots Below
U+063D ؽ   Arabic Letter Farsi Yeh With Inverted V

Azerbaijani

U+063E ؾ   Arabic Letter Farsi Yeh With Two Dots Above
U+063F ؿ   Arabic Letter Farsi Yeh With Three Dots Above
U+0640 ـ   Arabic Tatweel

inserted to stretch characters or to carry tashkil with no base letter also used with Adlam, Hanifi Rohingya, Mandaic, Manichaean, Psalter Pahlavi, Sogdian, and Syriac= kashida

U+0641 ف   Arabic Letter Feh
U+0642 ق   Arabic Letter Qaf
U+0643 ك   Arabic Letter Kaf
U+0644 ل   Arabic Letter Lam
U+0645 م   Arabic Letter Meem

Sindhi uses a shape with a short tail

U+0646 ن   Arabic Letter Noon
U+0647 ه   Arabic Letter Heh
U+0648 و   Arabic Letter Waw
U+0649 ى   Arabic Letter Alef Maksura

represents YEH-shaped dual-joining letter with no dots in any positional form not intended for use in combination with 0654 → U+0626 ئ Arabic Letter Yeh With Hamza Above

U+064A ي   Arabic Letter Yeh

loses its dots when used in combination with 0654 retains its dots when used in combination with other combining marks → U+08A8 ࢨ Arabic Letter Yeh With Two Dots Below And Hamza Above

U+064B   ً   Arabic Fathatan
U+064C   ٌ   Arabic Dammatan

a common alternative form is written as two intertwined dammas, one of which is turned 180 degrees

U+064D   ٍ   Arabic Kasratan
U+064E   َ   Arabic Fatha
U+064F   ُ   Arabic Damma
U+0650   ِ   Arabic Kasra
U+0651   ّ   Arabic Shadda
U+0652   ْ   Arabic Sukun

marks absence of a vowel after the base consonant used in some Qurans to mark a long vowel as ignored can have a variety of shapes, including a circular one and a shape that looks like '06E1' → U+06E1 ۡArabic Small High Dotless Head Of Khah

U+0653   ٓ   Arabic Maddah Above

used for madd jaa'iz in South Asian and Indonesian orthographies →U+089C ࢜ Arabic Madda Waajib →U+089E ࢞ Arabic Doubled Madda →U+089F ࢟ Arabic Half Madda Over Madda

U+0654   ٔ   Arabic Hamza Above

restricted to hamza and ezafe semantics is not used as a diacritic to form new letters

U+0655   ٕ   Arabic Hamza Below
U+0656   ٖ   Arabic Subscript Alef
U+0657   ٗ   Arabic Inverted Damma

Kashmiri, Urdu, Swahili, Somali

U+0658   ٘   Arabic Mark Noon Ghunna

Baluchi indicates nasalization in Urdu

U+0659   ٙ   Arabic Zwarakay

Pashto

U+065A   ٚ   Arabic Vowel Sign Small V Above

African languages

U+065B   ٛ   Arabic Vowel Sign Inverted Small V Above

African languages

U+065C   ٜ   Arabic Vowel Sign Dot Below

African languages also used in Quranic text in African and other orthographies

U+065D   ٝ   Arabic Reversed Damma

African languages

U+065E   ٞ   Arabic Fatha With Two Dots

Kalami

U+065F   ٟ   Arabic Wavy Hamza Below

Kashmiri

U+0660 ٠   Arabic-Indic Digit Zero
U+0661 ١   Arabic-Indic Digit One
U+0662 ٢   Arabic-Indic Digit Two
U+0663 ٣   Arabic-Indic Digit Three
U+0664 ٤   Arabic-Indic Digit Four
U+0665 ٥   Arabic-Indic Digit Five
U+0666 ٦   Arabic-Indic Digit Six
U+0667 ٧   Arabic-Indic Digit Seven
U+0668 ٨   Arabic-Indic Digit Eight
U+0669 ٩   Arabic-Indic Digit Nine
U+066A ٪   Arabic Percent Sign

→ U+0025 % Percent Sign

U+066B ٫   Arabic Decimal Separator

the ordinary comma is most commonly used instead

→ U+002C, Comma

U+066C ٬   Arabic Thousands Separator

the Arabic comma is most commonly used instead

→ U+060C ، Arabic Comma

→ U+0027 ' Apostrophe

→ U+2019 ’ Right Single Quotation Mark

U+066D   ٭   Arabic Five Pointed Star

appearance rather variable

→ U+002A * Asterisk

U+066E ٮ   Arabic Letter Dotless Beh
U+066F ٯ   Arabic Letter Dotless Qaf
U+0670 ٰ   Arabic Letter Superscript Alef
U+0671 ٱ   Arabic Letter Alef Wasla

Quranic Arabic

U+0672 ٲ   Arabic Letter Alef With Wavy Hamza Above

Baluchi, Kashmiri

U+0673 ٳ   Arabic Letter Alef With Wavy Hamza Below (deprecated)[7] Kashmiri

This character is deprecated and its use is strongly discouraged; the sequence 0627 065F is the preferred way of encoding this character.

U+0674 ٴ   Arabic Letter High Hamza

Kazakh, Jawi forms digraphs

U+0675 ٵ   Arabic Letter High Hamza Alef

preferred spelling is ‏ٴا‎‏ U+0674 U+0627

U+0676 ٶ   Arabic Letter High Hamza Waw

preferred spelling is ‏ٴو‎‏ U+0674 U+0648

U+0677 ٷ   Arabic Letter U With Hamza Above

preferred spelling is ‏ٴۇ‎‏ U+0674 U+06C7

U+0678 ٸ   Arabic Letter High Hamza Yeh

preferred spelling is ‏ٴی‎‏ U+0674 06CC

U+0679 ٹ   Arabic Letter Tteh

Urdu

U+067A ٺ   Arabic Letter Tteheh

Sindhi

U+067B ٻ   Arabic Letter Beeh

Sindhi

U+067C ټ   Arabic Letter Teh With Ring

Pashto

U+067D ٽ   Arabic Letter Teh With Three Dots Above Downwards

Sindhi

U+067E پ   Arabic Letter Peh

Persian, Urdu, ...

U+067F ٿ   Arabic Letter Teheh

Sindhi

U+0680 ڀ   Arabic Letter Beheh

Sindhi

U+0681 ځ   Arabic Letter Hah With Hamza Above

Pashto, Sarikoli represents the phoneme /dz/

U+0682 ڂ   Arabic Letter Hah With Two Dots Vertical Above

not used in modern Pashto

U+0683 ڃ   Arabic Letter Nyeh

Sindhi

U+0684 ڄ   Arabic Letter Dyeh

Sindhi, historically Bosnian

U+0685 څ   Arabic Letter Hah With Three Dots Above

Pashto, Khwarazmian, Sarikoli represents the phoneme /ts/ in Pashto

U+0686 چ   Arabic Letter Tcheh

Persian, Urdu, ...

U+0687 ڇ   Arabic Letter Tcheheh

Sindhi

U+0688 ڈ   Arabic Letter Ddal

Urdu

U+0689 ډ   Arabic Letter Dal With Ring

Pashto

U+068A ڊ   Arabic Letter Dal With Dot Below

Sindhi, early Persian, Pegon, Malagasy

U+068B ڋ   Arabic Letter Dal With Dot Below And Small Tah

Lahnda

U+068C ڌ   Arabic Letter Dahal

Sindhi

U+068D ڍ   Arabic Letter Ddahal

Sindhi

U+068E ڎ   Arabic Letter Dul

older shape for DUL, now obsolete in Sindhi Burushaski

U+068F ڏ   Arabic Letter Dal With Three Dots Above Downwards

Sindhi current shape used for DUL

U+0690 ڐ   Arabic Letter Dal With Four Dots Above

Old Urdu, not in current use

U+0691 ڑ   Arabic Letter Rreh

Urdu

U+0692 ڒ   Arabic Letter Reh With Small V

Kurdish

U+0693 ړ   Arabic Letter Reh With Ring

Pashto

U+0694 ڔ   Arabic Letter Reh With Dot Below

Kurdish, early Persian

U+0695 ڕ   Arabic Letter Reh With Small V Below

Kurdish

U+0696 ږ   Arabic Letter Reh With Dot Below And Dot Above

Pashto

U+0697 ڗ   Arabic Letter Reh With Two Dots Above

Dargwa

U+0698 ژ   Arabic Letter Jeh

Persian, Urdu, ...

U+0699 ڙ   Arabic Letter Reh With Four Dots Above

Sindhi

U+069A ښ   Arabic Letter Seen With Dot Below And Dot Above

Pashto

U+069B ڛ   Arabic Letter Seen With Three Dots Below

early Persian

U+069C ڜ   Arabic Letter Seen With Three Dots Below And Three Dots Above

Moroccan Arabic

U+069D ڝ   Arabic Letter Sad With Two Dots Below

Turkic

U+069E ڞ   Arabic Letter Sad With Three Dots Above

Berber, Burushaski

U+069F ڟ   Arabic Letter Tah With Three Dots Above

Old Hausa

U+06A0 ڠ   Arabic Letter Ain With Three Dots Above

Jawi

U+06A1 ڡ   Arabic Letter Dotless Feh

Adighe

U+06A2 ڢ   Arabic Letter Feh With Dot Moved Below

Maghrib Arabic

U+06A3 ڣ   Arabic Letter Feh With Dot Below

Ingush

U+06A4 ڤ   Arabic Letter Veh

Middle Eastern Arabic for foreign words Kurdish, Khwarazmian, early Persian, Jawi

U+06A5 ڥ   Arabic Letter Feh With Three Dots Below

North African Arabic for foreign words

U+06A6 ڦ   Arabic Letter Peheh

Sindhi

U+06A7 ڧ   Arabic Letter Qaf With Dot Above

Maghrib Arabic, Uyghur

U+06A8 ڨ   Arabic Letter Qaf With Three Dots Above

Tunisian and Algerian Arabic

U+06A9 ک   Arabic Letter Keheh

Persian, Urdu, Sindhi, ...= kaf mashkula

U+06AA ڪ   Arabic Letter Swash Kaf

represents a letter distinct from Arabic KAF (0643) in Sindhi

U+06AB ګ   Arabic Letter Kaf With Ring

Pashto may appear like an Arabic KAF (0643) with a ring below the base

U+06AC ڬ   Arabic Letter Kaf With Dot Above

use for the Jawi gaf is not recommended, although it may be found in some existing text data; recommended character for Jawi gaf is 0762 → U+0762 ݢ Arabic Letter Keheh With Dot Above

U+06AD ڭ   Arabic Letter Ng

Uyghur, Kazakh, Moroccan Arabic, early Jawi, early Persian, ...

U+06AE ڮ   Arabic Letter Kaf With Three Dots Below

Berber, early Persian Pegon alternative for 08B4

U+06AF گ   Arabic Letter Gaf

Persian, Urdu, ...

U+06B0 ڰ   Arabic Letter Gaf With Ring

Lahnda

U+06B1 ڱ   Arabic Letter Ngoeh

Sindhi

U+06B2 ڲ   Arabic Letter Gaf With Two Dots Below

not used in Sindhi

U+06B3 ڳ   Arabic Letter Gueh

Sindhi, Saraiki

U+06B4 ڴ   Arabic Letter Gaf With Three Dots Above

not used in Sindhi, Karakalpak

U+06B5 ڵ   Arabic Letter Lam With Small V

Kurdish, historically Bosnian

U+06B6 ڶ   Arabic Letter Lam With Dot Above

Kurdish

U+06B7 ڷ   Arabic Letter Lam With Three Dots Above

Kurdish

U+06B8 ڸ   Arabic Letter Lam With Three Dots Below

Avar, Soqotri

U+06B9 ڹ   Arabic Letter Noon With Dot Below
U+06BA ں   Arabic Letter Noon Ghunna

Urdu, archaic Arabic dotless in all four contextual forms

U+06BB ڻ   Arabic Letter Rnoon

dotless in all four contextual forms Sindhi

U+06BC ڼ   Arabic Letter Noon With Ring

Pashto

U+06BD ڽ   Arabic Letter Noon With Three Dots Above

Jawi

U+06BE ھ   Arabic Letter Heh Doachashmee

forms aspirate digraphs in Urdu and other languages of South Asia represents the glottal fricative /h/ in Uyghur

U+06BF ڿ   Arabic Letter Tcheh With Dot Above
U+06C0 ۀ   Arabic Letter Heh With Yeh Above

for ezafe, use 0654 over the language-appropriate base letter actually a ligature, not an independent letter Arabic letter hamzah on ha (1.0) ≡ ۀ U+06D5 U+0654

U+06C1 ہ   Arabic Letter Heh Goal

Urdu

U+06C2 ۂ   Arabic Letter Heh Goal With Hamza Above

Urdu actually a ligature, not an independent letter ≡ ۂ U+06C1 U+0654

U+06C3 ۃ   Arabic Letter Teh Marbuta Goal

Urdu

U+06C4 ۄ   Arabic Letter Waw With Ring

Kashmiri

U+06C5 ۅ   Arabic Letter Kirghiz Oe

Kyrgyz a glyph variant occurs which replaces the looped tail with a horizontal bar through the tail

U+06C6 ۆ   Arabic Letter Oe

Uyghur, Kurdish, Kazakh, Azerbaijani, historically Bosnian

U+06C7 ۇ   Arabic Letter U

Azerbaijani, Kazakh, Kyrgyz, Uyghur

U+06C8 ۈ   Arabic Letter Yu

Uyghur

U+06C9 ۉ   Arabic Letter Kirghiz Yu

Kazakh, Kyrgyz, historically Bosnian

U+06CA ۊ   Arabic Letter Waw With Two Dots Above

Kurdish

U+06CB ۋ   Arabic Letter Ve

Uyghur, Kazakh

U+06CC ی   Arabic Letter Farsi Yeh

Arabic, Persian, Urdu, Kashmiri, ... initial and medial forms of this letter have dots → U+0649 ى ARABIC LETTER ALEF MAKSURA → U+064A ي Arabic Letter Yeh

U+06CD ۍ   Arabic Letter Yeh With Tail

Pashto, Sindhi

U+06CE ێ   Arabic Letter Yeh With Small V

Kurdish

U+06CF ۏ   Arabic Letter Waw With Dot Above

Jawi

U+06D0 ې   Arabic Letter E

Pashto, Uyghur used as the letter bbeh in Sindhi

U+06D1 ۑ   Arabic Letter Yeh With Three Dots Below

Mende languages, Hausa

U+06D2 ے   Arabic Letter Yeh Barree

Urdu

U+06D3 ۓ   Arabic Letter Yeh Barree With Hamza Above

Urdu

U+06D4 ۔   Arabic Full Stop

Urdu

U+06D5 ە   Arabic Letter Ae

Uyghur, Kazakh, Kyrgyz

U+06D6 ۖ   Arabic Small High Ligature Sad With Lam With Alef Maksura
U+06D7 ۗ   Arabic Small High Ligature Qaf With Lam With Alef Maksura
U+06D8 ۘ   Arabic Small High Meem Initial Form
U+06D9 ۙ   Arabic Small High Lam Alef
U+06DA ۚ   Arabic Small High Jeem
U+06DB   ۛ   Arabic Small High Three Dots
U+06DC ۜ   Arabic Small High Seen
U+06DD ۝   Arabic End of Ayah
U+06DE ۞   Arabic Star of Rub El Hizb
U+06DF   ۟   Arabic Small High Rounded Zero

smaller than the typical circular shape used for 0652

U+06E0   ۠   Arabic Small High Upright Rectangular Zero

the term "rectangular zero" is a translation of the Arabic name of this sign

U+06E1 ۡ   Arabic Small High Dotless Head Of Khah presentation form of 0652, using font technology to select the variant is preferred

used in some Qurans to mark absence of a vowel= Arabic jazm → U+0652 ْ Arabic Sukun

U+06E2 ۢ   Arabic Small High Meem Isolated Form
U+06E3 ۣ   Arabic Small Low Seen
U+06E4   ۤ   Arabic Small High Madda

typically used with 06E5, 06E6, 06E7, and 08F3

U+06E5 ۥ   Arabic Small Waw

→ U+08D3 ࣓ Arabic Small Low Waw → U+08F3 ࣳ Arabic Small High Waw

U+06E6 ۦ   Arabic Small Yeh
U+06E7 ۧ   Arabic Small High Yeh
U+06E8 ۨ   Arabic Small High Noon
U+06E9 ۩   Arabic Place Of Sajdah

there is a range of acceptable glyphs for this character

U+06EA   ۪   Arabic Empty Centre Low Stop
U+06EB   ۫   Arabic Empty Centre High Stop
U+06EC   ۬   Arabic Rounded High Stop With Filled Centre

also used in Quranic text in African and other orthographies to represent wasla, ikhtilas, etc.

U+06ED ۭ   Arabic Small Low Meem
U+06EE ۮ   Arabic Letter Dal With Inverted V
U+06EF ۯ   Arabic Letter Reh With Inverted V

also used in early Persian

U+06F0 ۰   Extended Arabic-Indic Digit Zero
U+06F1 ۱   Extended Arabic-Indic Digit One
U+06F2 ۲   Extended Arabic-Indic Digit Two
U+06F3 ۳   Extended Arabic-Indic Digit Three
U+06F4 ۴   Extended Arabic-Indic Digit Four

Persian has a different glyph than Sindhi and Urdu

U+06F5 ۵   Extended Arabic-Indic Digit Five

Persian, Sindhi, and Urdu share glyph different from Arabic

U+06F6 ۶   Extended Arabic-Indic Digit Six

Persian, Sindhi, and Urdu have glyphs different from Arabic

U+06F7 ۷   Extended Arabic-Indic Digit Seven

Urdu and Sindhi have glyphs different from Arabic

U+06F8 ۸   Extended Arabic-Indic Digit Eight
U+06F9 ۹   Extended Arabic-Indic Digit Nine
U+06FA ۺ   Arabic Letter Sheen With Dot Below
U+06FB ۻ   Arabic Letter Dad With Dot Below
U+06FC ۼ   Arabic Letter Ghain With Dot Below
U+06FD ۽   Arabic Sign Sindhi Ampersand
U+06FE ۾   Arabic Sign Sindhi Postposition Men
U+06FF ۿ   Arabic Letter Heh With Inverted V

Compact table

[edit]
Arabic[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+060x  ؀   ؁   ؂   ؃   ؄   ؅  ؆ ؇ ؈ ؉ ؊ ؋ ، ؍ ؎ ؏
U+061x ؐ ؑ ؒ ؓ ؔ ؕ ؖ ؗ ؘ ؙ ؚ ؛  ALM  ؝ ؞ ؟
U+062x ؠ ء آ أ ؤ إ ئ ا ب ة ت ث ج ح خ د
U+063x ذ ر ز س ش ص ض ط ظ ع غ ػ ؼ ؽ ؾ ؿ
U+064x ـ ف ق ك ل م ن ه و ى ي ً ٌ ٍ َ ُ
U+065x ِ ّ ْ ٓ ٔ ٕ ٖ ٗ ٘ ٙ ٚ ٛ ٜ ٝ ٞ ٟ
U+066x ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ٪ ٫ ٬ ٭ ٮ ٯ
U+067x ٰ ٱ ٲ ٳ ٴ ٵ ٶ ٷ ٸ ٹ ٺ ٻ ټ ٽ پ ٿ
U+068x ڀ ځ ڂ ڃ ڄ څ چ ڇ ڈ ډ ڊ ڋ ڌ ڍ ڎ ڏ
U+069x ڐ ڑ ڒ ړ ڔ ڕ ږ ڗ ژ ڙ ښ ڛ ڜ ڝ ڞ ڟ
U+06Ax ڠ ڡ ڢ ڣ ڤ ڥ ڦ ڧ ڨ ک ڪ ګ ڬ ڭ ڮ گ
U+06Bx ڰ ڱ ڲ ڳ ڴ ڵ ڶ ڷ ڸ ڹ ں ڻ ڼ ڽ ھ ڿ
U+06Cx ۀ ہ ۂ ۃ ۄ ۅ ۆ ۇ ۈ ۉ ۊ ۋ ی ۍ ێ ۏ
U+06Dx ې ۑ ے ۓ ۔ ە ۖ ۗ ۘ ۙ ۚ ۛ ۜ  ۝  ۞ ۟
U+06Ex ۠ ۡ ۢ ۣ ۤ ۥ ۦ ۧ ۨ ۩ ۪ ۫ ۬ ۭ ۮ ۯ
U+06Fx ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹ ۺ ۻ ۼ ۽ ۾ ۿ
Notes
1.^ As of Unicode version 17.0
2.^ Unicode code point U+0673 is deprecated as of Unicode version 6.0

Arabic Supplement

[edit]
Arabic Supplement[1]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+075x ݐ ݑ ݒ ݓ ݔ ݕ ݖ ݗ ݘ ݙ ݚ ݛ ݜ ݝ ݞ ݟ
U+076x ݠ ݡ ݢ ݣ ݤ ݥ ݦ ݧ ݨ ݩ ݪ ݫ ݬ ݭ ݮ ݯ
U+077x ݰ ݱ ݲ ݳ ݴ ݵ ݶ ݷ ݸ ݹ ݺ ݻ ݼ ݽ ݾ ݿ
Notes
1.^ As of Unicode version 17.0

Arabic Extended-B

[edit]
Arabic Extended-B[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+087x
U+088x
U+089x  ࢐   ࢑ 
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

Arabic Extended-A

[edit]
Arabic Extended-A[1]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+08Ax
U+08Bx
U+08Cx
U+08Dx
U+08Ex  ࣢ 
U+08Fx
Notes
1.^ As of Unicode version 17.0

Arabic Presentation Forms A

[edit]

They are mostly ligatures which can be created from the previous charts' characters, with the exception of the bracket-like graphemes ﴾ ﴿ and some of them are ligatures of common liturgical phrases.

Arabic Presentation Forms-A[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+FB5x
U+FB6x
U+FB7x ﭿ
U+FB8x
U+FB9x
U+FBAx
U+FBBx ﮿
U+FBCx
U+FBDx
U+FBEx
U+FBFx ﯿ
U+FC0x
U+FC1x
U+FC2x
U+FC3x ﰿ
U+FC4x
U+FC5x
U+FC6x
U+FC7x ﱿ
U+FC8x
U+FC9x
U+FCAx
U+FCBx ﲿ
U+FCCx
U+FCDx
U+FCEx
U+FCFx ﳿ
U+FD0x
U+FD1x
U+FD2x
U+FD3x ﴿
U+FD4x
U+FD5x
U+FD6x
U+FD7x ﵿ
U+FD8x
U+FD9x
U+FDAx
U+FDBx ﶿ
U+FDCx
U+FDDx
U+FDEx
U+FDFx ﷿
Notes
1.^ As of Unicode version 17.0
2.^ Black areas indicate noncharacters (code points that are guaranteed never to be assigned as encoded characters in the Unicode Standard)

Arabic Presentation Forms B

[edit]

These can all be created from the basic chart's characters.

Arabic Presentation Forms-B[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+FE7x ﹿ
U+FE8x
U+FE9x
U+FEAx
U+FEBx ﺿ
U+FECx
U+FEDx
U+FEEx
U+FEFx ZW
NBSP
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

Rumi Numeral Symbols

[edit]
Rumi Numeral Symbols[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+10E6x 𐹠 𐹡 𐹢 𐹣 𐹤 𐹥 𐹦 𐹧 𐹨 𐹩 𐹪 𐹫 𐹬 𐹭 𐹮 𐹯
U+10E7x 𐹰 𐹱 𐹲 𐹳 𐹴 𐹵 𐹶 𐹷 𐹸 𐹹 𐹺 𐹻 𐹼 𐹽 𐹾
Notes
1.^ As of Unicode version 17.0
2.^ Grey area indicates non-assigned code point

Arabic Extended-C

[edit]
Arabic Extended-C[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+10ECx 𐻂 𐻃 𐻄 𐻅 𐻆 𐻇
U+10EDx 𐻐 𐻑 𐻒 𐻓 𐻔 𐻕 𐻖 𐻗 𐻘
U+10EEx
U+10EFx 𐻺 𐻻 𐻼 𐻽 𐻾 𐻿
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

Indic Siyaq Numbers

[edit]
Indic Siyaq Numbers[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+1EC7x 𞱱 𞱲 𞱳 𞱴 𞱵 𞱶 𞱷 𞱸 𞱹 𞱺 𞱻 𞱼 𞱽 𞱾 𞱿
U+1EC8x 𞲀 𞲁 𞲂 𞲃 𞲄 𞲅 𞲆 𞲇 𞲈 𞲉 𞲊 𞲋 𞲌 𞲍 𞲎 𞲏
U+1EC9x 𞲐 𞲑 𞲒 𞲓 𞲔 𞲕 𞲖 𞲗 𞲘 𞲙 𞲚 𞲛 𞲜 𞲝 𞲞 𞲟
U+1ECAx 𞲠 𞲡 𞲢 𞲣 𞲤 𞲥 𞲦 𞲧 𞲨 𞲩 𞲪 𞲫 𞲬 𞲭 𞲮 𞲯
U+1ECBx 𞲰 𞲱 𞲲 𞲳 𞲴
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

Ottoman Siyaq Numbers

[edit]
Ottoman Siyaq Numbers[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+1ED0x 𞴁 𞴂 𞴃 𞴄 𞴅 𞴆 𞴇 𞴈 𞴉 𞴊 𞴋 𞴌 𞴍 𞴎 𞴏
U+1ED1x 𞴐 𞴑 𞴒 𞴓 𞴔 𞴕 𞴖 𞴗 𞴘 𞴙 𞴚 𞴛 𞴜 𞴝 𞴞 𞴟
U+1ED2x 𞴠 𞴡 𞴢 𞴣 𞴤 𞴥 𞴦 𞴧 𞴨 𞴩 𞴪 𞴫 𞴬 𞴭 𞴮 𞴯
U+1ED3x 𞴰 𞴱 𞴲 𞴳 𞴴 𞴵 𞴶 𞴷 𞴸 𞴹 𞴺 𞴻 𞴼 𞴽
U+1ED4x
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

Arabic Mathematical Alphabetic Symbols

[edit]
Arabic Mathematical Alphabetic Symbols[1][2]
Official Unicode Consortium code chart (PDF)
  0 1 2 3 4 5 6 7 8 9 A B C D E F
U+1EE0x 𞸀 𞸁 𞸂 𞸃 𞸅 𞸆 𞸇 𞸈 𞸉 𞸊 𞸋 𞸌 𞸍 𞸎 𞸏
U+1EE1x 𞸐 𞸑 𞸒 𞸓 𞸔 𞸕 𞸖 𞸗 𞸘 𞸙 𞸚 𞸛 𞸜 𞸝 𞸞 𞸟
U+1EE2x 𞸡 𞸢 𞸤 𞸧 𞸩 𞸪 𞸫 𞸬 𞸭 𞸮 𞸯
U+1EE3x 𞸰 𞸱 𞸲 𞸴 𞸵 𞸶 𞸷 𞸹 𞸻
U+1EE4x 𞹂 𞹇 𞹉 𞹋 𞹍 𞹎 𞹏
U+1EE5x 𞹑 𞹒 𞹔 𞹗 𞹙 𞹛 𞹝 𞹟
U+1EE6x 𞹡 𞹢 𞹤 𞹧 𞹨 𞹩 𞹪 𞹬 𞹭 𞹮 𞹯
U+1EE7x 𞹰 𞹱 𞹲 𞹴 𞹵 𞹶 𞹷 𞹹 𞹺 𞹻 𞹼 𞹾
U+1EE8x 𞺀 𞺁 𞺂 𞺃 𞺄 𞺅 𞺆 𞺇 𞺈 𞺉 𞺋 𞺌 𞺍 𞺎 𞺏
U+1EE9x 𞺐 𞺑 𞺒 𞺓 𞺔 𞺕 𞺖 𞺗 𞺘 𞺙 𞺚 𞺛
U+1EEAx 𞺡 𞺢 𞺣 𞺥 𞺦 𞺧 𞺨 𞺩 𞺫 𞺬 𞺭 𞺮 𞺯
U+1EEBx 𞺰 𞺱 𞺲 𞺳 𞺴 𞺵 𞺶 𞺷 𞺸 𞺹 𞺺 𞺻
U+1EECx
U+1EEDx
U+1EEEx
U+1EEFx 𞻰 𞻱
Notes
1.^ As of Unicode version 17.0
2.^ Grey areas indicate non-assigned code points

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
The Arabic script in Unicode is a comprehensive encoding system that supports the cursive, right-to-left writing direction of Arabic and over 30 related languages, including Persian, Urdu, Pashto, Sindhi, and various African languages such as Hausa and Berber, through multiple character blocks and advanced rendering rules for contextual forms and diacritics. Introduced in Unicode 1.0 in 1991 and expanded across subsequent versions up to Unicode 17.0 in 2025, it addresses the script's inherent complexity, including mandatory cursive joining where letters adopt initial, medial, final, or isolated glyphs based on their position in a word. The core encoding resides in the Arabic block (U+0600–U+06FF), which includes 188 basic characters such as consonants (e.g., U+0627 ا ARABIC LETTER ALEF), vowels via diacritics like U+064E ◌َ ARABIC FATHA, and distinguishing dots (ijam) for letters like U+0628 ب ARABIC LETTER BEH. This block also incorporates Arabic-Indic digits (U+0660–U+0669) and extended variants (U+06F0–U+06F9) used in regions like the Middle East and South Asia. Supplementary blocks extend support for regional and historical variants: the Arabic Supplement (U+0750–U+077F) adds letters for African scripts and other regional variants; Arabic Extended-A (U+08A0–U+08FF) includes characters for Warsh and Dughast orthographies prevalent in North Africa; Arabic Extended-B (U+0870–U+089F) covers additional joins and marks; and Arabic Extended-C (U+10EC0–U+10EFF), introduced in Unicode 15.0, encodes rare manuscript forms from Ottoman and Persian traditions. Rendering of Arabic text in Unicode relies on the Bidirectional Algorithm for right-to-left layout and shaping engines that apply the script's Joining_Type and Joining_Group properties to form ligatures and positional variants automatically. Diacritical marks, known as tashkil for vowel indication and small letterforms for emphasis (e.g., U+0650 ◌ِ ARABIC KASRA), stack above or below base letters using specific Canonical_Combining_Class values as detailed in Unicode Standard Annex #53. Presentation forms in blocks like Arabic Presentation Forms-A (U+FB50–U+FDFF) and -B (U+FE70–U+FEFF) provide precomposed ligatures and final forms for legacy compatibility, though modern systems favor logical encoding with dynamic shaping. Specialized extensions accommodate religious and scholarly needs, such as Quranic annotation symbols (e.g., U+06DD ۝ ARABIC END OF AYAH) and honorifics like U+0610 ۖ ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM, ensuring fidelity to sacred texts and calligraphic traditions. Overall, Unicode's Arabic script implementation balances universality with orthographic diversity, enabling digital preservation of over 1,400 years of literary heritage across the Islamic world and beyond.

Introduction

Overview

The Arabic script is a right-to-left abjad characterized by 28 basic letters that primarily represent consonants, with vowels often implied or indicated by optional diacritics. It accommodates variants for numerous languages beyond Arabic, including Persian (with four additional letters like پ for /p/), Urdu (adding letters like ڑ for retroflex sounds), and African scripts such as those used in Hausa, Fulfulde, and Wolof. This script's cursive nature requires letters to connect in specific ways depending on their position in a word, a feature that Unicode supports through advanced rendering mechanisms rather than precomposed forms for all combinations. Unicode adopts an atomic encoding model for the Arabic script, assigning unique code points to base letters while treating diacritics—such as the short vowel marks fatha, kasra, and damma—as independent combining characters that overlay the base. Contextual glyph shaping, which selects among a letter's four potential forms (initial, medial, final, or isolated), is performed by font rendering engines using the OpenType specification or similar technologies, ensuring flexibility across diverse writing systems. In Unicode 17.0, the Arabic script encompasses over 1,400 characters distributed across 11 blocks, enabling comprehensive coverage for Modern Standard Arabic orthography, intricate Qur'anic annotations (including specialized tatweel and ornamentation), historical numeral systems like Eastern Arabic and Rumi variants, and mathematical symbols tailored for Arabic-based expressions. Bidirectional text processing is integral to this encoding, with the Unicode Bidirectional Algorithm (UBA) resolving the visual order of mixed right-to-left Arabic and left-to-right Latin scripts by analyzing embedding levels and character directions. Notably, certain legacy characters have been deprecated to promote consistent usage; for instance, U+0673 (Arabic letter high U) was deprecated starting with Unicode 6.0 in favor of compositional alternatives using standard hamza and base forms. This approach underscores Unicode's emphasis on stability and interoperability for Arabic digital representation.

Historical Development

The encoding of the Arabic script in Unicode began with the adoption of the ISO/IEC 8859-6 standard, published in 1987, which defined a single-byte character set for basic Arabic letters, numerals, and diacritics used in the Arabic language. This framework was directly incorporated into Unicode Version 1.0, released in October 1991, establishing the core Arabic block (U+0600–U+06FF) with 144 characters to support essential script elements while aligning with emerging international standards for multilingual text processing. The inclusion addressed immediate needs for digital representation of Arabic in computing environments, drawing from the ISO standard's logical ordering and bidi-directional properties. Early expansions focused on rendering complexities inherent to the cursive Arabic script. Unicode Version 1.1 (June 1993) introduced the Arabic Presentation Forms-A (U+FB50–U+FDFF) and Arabic Presentation Forms-B (U+FE70–FEFF) blocks, encoding precomposed ligatures and contextual glyph variants to facilitate compatibility with legacy systems and initial text layout requirements. These additions mitigated challenges in displaying joined forms without advanced shaping engines, a key hurdle for the script's right-to-left, context-dependent behavior. Further growth came in Unicode 4.0 (October 2005), which added the Arabic Supplement block (U+0750–U+077F) for variant letters in non-Arabic languages, such as those in Central Asia and Africa, expanding the repertoire beyond classical Arabic orthography. Subsequent versions emphasized support for regional and historical usages. Unicode 5.0 (July 2006) established the Arabic Extended-A block (U+08A0–U+08FF), incorporating 59 characters for Qur'anic annotations, African language variants, and extensions like Berber script modifications, proposed in part by linguistic experts to preserve orthographic traditions. By Unicode 11.0 (June 2018), historical numeral systems were integrated through the Rumi Numeral Symbols (U+10E60–U+10E7F) for Ottoman-era notations and Indic Siyaq Numbers (U+1EC00–U+1EC4F) for Mughal accounting practices in South Asia, addressing the need to encode non-positional, cursive numeral forms used in archival documents. Later milestones tackled specialized domains and further extensions. Unicode 12.0 (March 2019) introduced Arabic Extended-B (U+0870–U+089F) with 48 characters for additional African scripts, the Ottoman Siyaq Numbers block (U+1ED00–U+1ED4F) extending Siyaq support to Turkish variants, and Arabic Mathematical Alphabetic Symbols (U+1EE00–U+1EEFF) providing bold, italic, and styled forms for mathematical typography while avoiding shaping dependencies. These addressed encoding models for complex joining in historical numerals, as analyzed in proposals evaluating Siyaq systems for plain-text representation. Unicode 15.0 (September 2022) added Arabic Extended-C (U+10EC0–U+10EFF), including three Qur'anic annotation characters to refine pronunciation marking in religious texts. Key contributions came from industry and academic sources. Microsoft played a foundational role in early Unicode development, including Arabic glyph integration during Version 1.0 reviews, and later advanced shaping through OpenType specifications for bidirectional rendering. SIL International submitted influential proposals, such as document L2/10-288R in 2010, advocating 35 characters for African (e.g., Fulah, Hausa) and Asian (e.g., Rohingya) languages, which informed additions in Extended-A and subsequent blocks like Extended-B in 2019. Unicode 16.0 (September 2024) added characters to the core Arabic block for early Persian and Azerbaijani orthographies, such as variant keheh forms. Unicode 17.0 (September 2025) added characters to Arabic Extended-C, including three Pegon letters (e.g., U+10EC2 ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW) for Javanese and Southeast Asian Islamic texts, as well as additional Qur'anic marks like U+10EFC ARABIC COMBINING ALEF OVERLAY (used in Libya) and U+10EFF ARABIC SMALL LOW WORD MADDA (used in Turkey). These updates enhanced support for regional variants, including tone marks and overlays for non-Arabic phonologies. As the latest version as of November 2025, Unicode 17.0 encompasses over 1,400 Arabic-script code points across multiple blocks, with ongoing proposals targeting additional African and Southeast Asian variants to complete coverage of living orthographies.

Encoding Fundamentals

Contextual Forms and Glyph Shaping

The Arabic script is cursive, meaning its letters change shape depending on their position within a word: isolated (standalone or non-joining contexts), initial (at the beginning of a word or after a non-joining letter), medial (in the middle, connecting to both preceding and following letters), and final (at the end of a word or before a non-joining letter). For example, the letter beh (U+0628) appears as isolated ب, final ـب, initial بـ, or medial ـبـ, with the appropriate glyph selected dynamically during rendering rather than encoded as separate Unicode characters. These four positional forms ensure the script's connected appearance while allowing flexibility for non-joining letters like alef (ا, U+0627), which only joins to the left and thus typically takes an isolated or initial form depending on context. Unicode defines shaping through the Joining_Type and Joining_Group properties, with 22 joining groups for standard Arabic letters (such as ALEF, BEH, LAM, and MEEM) that determine compatibility for connections between adjacent characters. Joining types include Dual_Joining (connects both sides, e.g., beh), Right_Joining (connects only left, e.g., alef), Left_Joining (connects only right, e.g., dal), and Non_Joining (no connections, e.g., hamza); mandatory joining occurs between compatible dual or right/left pairs, while optional joining may apply in stylistic contexts like certain ligatures. These properties follow seven conformance rules (R1–R7) for determining forms: transparent characters (like diacritics) are ignored (R1), right-joining letters take initial forms before join-causing characters (R2), and dual-joining letters adopt medial, initial, or final forms based on adjacent joiners (R4–R6), defaulting to isolated otherwise (R7). Rendering relies on layout engines that analyze Unicode code points and apply these rules dynamically, without precomposed positional glyphs in the core encoding; instead, base characters (U+0600–U+06FF) are substituted via OpenType font features in GSUB tables (e.g., 'init' for initial, 'medi' for medial) and positioned with GPOS (e.g., 'curs' for cursive attachments). Libraries like HarfBuzz implement this process, processing text through stages of script analysis, glyph substitution, and positioning to produce the correct cursive flow. For instance, the Modern Standard Arabic word "kitab" (كتاب, logical order: U+0643 U+062A U+0627 U+0628) results in kaf taking an initial form (كـ, joining right to ta), ta a final form (ت, joining left to kaf but not right to alef), alef isolated (ا, right-joining but no right joiner), and ba final (ـب, at word end). Extended languages build on these behaviors with additional joining groups; for Persian, the farsi yeh (U+06CC) belongs to the FARSI_YEH group, enabling distinct medial and final forms (ـیـ and ـی) that differ from standard yeh (ـيـ and ـي). In Urdu, similar extensions apply, such as retroflex letters joining like dual forms, supported through the same OpenType features but with language-specific tags (e.g., 'URD ' for Urdu) to select appropriate alternates. This ensures compatibility across variants while maintaining Unicode's unified base encoding.

Diacritics and Combining Characters

The Arabic script employs a system of diacritics known as harakat or tashkil to indicate short vowels and other phonetic features, which are encoded in Unicode primarily as combining characters that attach to base letters. These marks are essential for precise pronunciation in educational texts, religious scriptures, and classical literature, where they overlay consonants without altering the letter's positional form. Basic diacritics include the fatha (U+064E َ, representing a short /a/ sound), damma (U+064F ُ, for /u/), kasra (U+0650 ِ, for /i/), shadda (U+0651 ّ, doubling the consonant), and sukun (U+0652 ْ, indicating no vowel). Tanween forms, which denote indefinite nouns with nunation, are similarly encoded as combining marks: fathatan (U+064B ً), dammatan (U+064C ٌ), and kasratan (U+064D ٍ). Multiple diacritics can stack on a single base letter, following specific canonical combining classes (CCC) to ensure correct vertical ordering during rendering. Arabic diacritics are assigned CCC values such as 27 for tanween above (e.g., fathatan), 220 for marks below the base, and 230 for certain high annotations, with vowels typically in classes 30–32 and shadda in 33. The Unicode Arabic Mark Transient Reordering Algorithm (AMTRA) reorders these marks from inside-out, positioning shadda closest to the base followed by modifier marks, then vowels, to handle complex stacks like a letter with shadda and fatha (e.g., <U+0628, U+0651, U+064E> for بَـّ). In intricate cases, such as overriding default ordering for Qur'anic variants, the combining grapheme joiner (CGJ, U+034F) may be inserted to preserve sequence integrity without visual effect. Extended diacritics support specialized orthographies and annotations, particularly in religious texts. Qur'anic annotations include the small high meem isolated form (U+06E2 ۢ, CCC=230, used for emphasis in recitation) and other signs like the small low seen (U+06E3 ۣ). In Arabic Extended-A (U+08A0–U+08FF), additional marks cater to Warsh orthography variants prevalent in North and West African traditions, such as the Arabic sukun below (U+08D0 ࣐) and small low waw (U+08D3 ࣓), alongside annotations like the small high word sah (U+08CC ࣌, a pause sign). These extended marks integrate with basic tashkil to denote regional pronunciations or scriptural nuances, such as in the Warsh reading of the Qur'an. Tashkil usage exemplifies the system's flexibility; for instance, the word "kitab" (book) is encoded as <U+0643, U+0650, U+062A, U+064E, U+0627, U+0628> (كِتَاب), where kasra and fatha provide vowel guidance. While diacritics attach to base letters independently of their contextual shapes (initial, medial, final, or isolated), rendering engines apply them post-shaping for accurate placement. This separation ensures compatibility across fonts and layouts, though proper display requires support for Arabic-specific combining classes.

Presentation and Compatibility Forms

The presentation and compatibility forms for the Arabic script in Unicode consist of precomposed characters that encode specific glyph shapes and ligatures, primarily to support interoperability with legacy encoding standards and font systems predating advanced text shaping capabilities. These forms address the cursive nature of Arabic by providing fixed representations of contextual variants, which were necessary in environments like early DOS codepages or 8-bit encodings that could not dynamically adjust glyphs based on adjacent characters. The Arabic Presentation Forms-A block (U+FB50–U+FDFF) focuses on ligatures and contextual glyphs tailored for particular languages, scripts, or religious contexts, such as those in Persian, Urdu, or Islamic honorifics. For instance, it includes U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM ﷲ, which combines multiple base letters into a single precomposed character for traditional rendering. These characters were encoded to round-trip data from standards requiring such fixed forms, but they are not intended for new content creation. In contrast, the Arabic Presentation Forms-B block (U+FE70–U+FEFF) provides individual positional forms for core Arabic letters, capturing isolated, initial, medial, and final shapes, along with spacing versions of diacritics and certain lam-alef ligatures. An example is U+FE8B ARABIC LETTER BEH INITIAL FORM ﺋ, which represents the beh letter in its initial position within a word. This block supports compatibility with legacy systems by offering direct mappings to pre-shaped glyphs in fonts without built-in cursive joining logic. Despite their utility for backward compatibility, these presentation forms are discouraged in contemporary usage because they duplicate functionality achievable through modern rendering techniques, such as OpenType's Glyph Substitution (GSUB) tables applied to base characters from the Arabic block (U+0600–U+06FF). Relying on precomposed forms can lead to inefficiencies in storage, searchability, and extensibility, as they bypass dynamic shaping that adapts to font features, language variations, and complex sequences. Unicode explicitly recommends encoding text with base letters and marks, then using shaping engines for glyph selection to ensure robust, future-proof representation. For migrating legacy content that employs these forms, Unicode Normalization Form KC (NFKC) is advised, as it performs compatibility decomposition to break down presentation characters into their constituent base elements and combining marks, facilitating conversion to shaped text without loss of meaning. This process helps integrate older Arabic data into modern pipelines while preserving semantic integrity.

Special Characters and Features

Punctuation and Ornaments

The Arabic script in Unicode includes several dedicated punctuation marks that support its right-to-left writing direction and orthographic traditions, distinct from Latin equivalents to ensure proper rendering and cultural accuracy. Core punctuation encompasses the Arabic comma (U+060C ،), used to separate clauses or items in lists, and also employed in modern texts with related scripts like Thaana and Syriac. The Arabic semicolon (U+061B ؛) functions similarly to its Latin counterpart for separating independent clauses, with shared usage in Thaana and Syriac orthography. The Arabic question mark (U+061F ؟) mirrors the Latin question mark in form but is encoded separately to align with right-to-left flow, appearing in Thaana and Syriac as well. Additionally, the Arabic full stop (U+06D4 ۔), often called a period or point, terminates sentences and is prevalent in Urdu, Persian, and Punjabi texts. Ornaments and decorative symbols enhance poetic and textual embellishment in Arabic script. The Arabic five-pointed star (U+066D ٭) serves as a bullet point, footnote marker, or emphasis symbol, with variable glyph appearances across fonts. Decoration marks include the Arabic poetic verse sign (U+060E ؎), which denotes the start of a verse in classical poetry, and the Arabic sign Misra (U+060F ؏), marking hemistich divisions in poetic lines. These elements are integral to literary and Quranic presentations, where they provide structural cues without altering the primary text flow. In contextual applications, particularly religious texts, the Arabic tatweel (U+0640 ـ), also known as kashida, is inserted between letters to justify line lengths or elongate words for aesthetic balance, and it can carry diacritics when no base letter is present. Currency signs like the Afghani sign (U+060B ؋) represent the Afghan currency in Pashto and Dari texts, derived from an abbreviation that has evolved into a logographic form. Bidirectional text processing requires careful handling of Arabic punctuation, as these marks are classified as neutral (Bidi_Class=ON) and inherit directionality from surrounding characters, ensuring proper pairing with right-to-left elements even when embedded in left-to-right contexts like English loanwords. Regional variants highlight adaptations, such as the Persian question mark (U+061F ؟), which uses the mirrored Arabic form for right-to-left alignment, differing from the non-mirrored Latin question mark (?) in mixed-script environments.

Ligatures and Joined Forms

In the Arabic script, ligatures are essential for cursive joining, where certain letter combinations form unified glyphs to reflect natural handwriting flow. The most prevalent mandatory ligature is the lam-alef combination, formed from the base characters Arabic Letter Lam (U+0644) and Arabic Letter Alef (U+0627), resulting in forms such as لا. This ligature is not precomposed in the core Arabic blocks but is generated dynamically through text shaping engines that analyze joining behavior, ensuring seamless rendering across positions (isolated, initial, medial, final). Compatibility variants of lam-alef ligatures, such as Arabic Ligature Lam with Alef Isolated Form (U+FEFB ﻻ), exist in the Arabic Presentation Forms-B block for legacy round-trip mapping, though modern usage favors shaping over these decomposed forms. Religious ligatures hold profound cultural and spiritual significance in Islamic typography, particularly in sacred texts where they symbolize reverence and prevent fragmentation of holy names. The Arabic Ligature Allah Isolated Form (U+FDF2 ﷲ), a precomposed character in the Arabic Presentation Forms-A block, represents the word "Allah" as a single glyph, traditionally used to honor the divine name in Sunni scriptural traditions. Similarly, the Arabic Ligature Sallallahou Alayhe Wasallam (U+FDFA ﷺ), encoding the phrase invoking blessings upon the Prophet Muhammad, serves an analogous role, appearing in religious manuscripts and digital Quran editions to maintain orthographic sanctity. These compatibility characters, introduced in Unicode 1.1, allow precise rendering without relying on complex shaping, though their use is contextualized within broader Islamic calligraphic practices. Discretionary ligatures extend beyond mandatory joins, enhancing aesthetic variety in specific calligraphic styles like Naskh and Nastaliq. In Naskh, a style favored for print and digital text due to its clarity, optional ligatures such as those involving ya or waw may be applied for visual harmony, controlled via OpenType font features like the 'calt' (contextual alternates) table. Nastaliq, prominent in Persian and Urdu poetry, employs more intricate discretionary joins with sweeping connections, where font engines select variants based on stylistic sets to mimic manuscript fluidity. These are not enforced by Unicode's core joining rules but depend on font implementation for discretionary activation. Unicode encoding strategies for Arabic ligatures prioritize flexibility in plain text while accommodating advanced control in rich formats. Base letters are combined with the Zero Width Joiner (U+200D, ZWJ) to force joins or ligatures where standard shaping might separate them, such as in stylized or historical reproductions; conversely, the Zero Width Non-Joiner (U+200C, ZWNJ) inhibits unwanted connections. This approach avoids proliferation of precomposed characters in plain text, deferring specifics to rich text environments like HTML with CSS font features or PDF embedding, ensuring portability across systems. Historically, ligatures were integral to Ottoman manuscripts, where elaborate joins in styles like Diwani or Ruq'ah preserved artistic and scribal traditions in imperial documents and Qurans. Unicode's shaping model and presentation forms now enable digital preservation of these practices, allowing high-fidelity reproduction of Ottoman-era typography in archives and educational tools without loss of cursive integrity.

Core Unicode Blocks

Arabic Block

The Arabic block in Unicode spans the code point range U+0600–U+06FF, encompassing 256 positions of which 188 are assigned characters dedicated to the core elements of the Arabic script. This block serves as the foundational encoding for Modern Standard Arabic, providing the essential characters for text representation in that language, including a total of 69 assigned letters across its various categories. Key contents include the 28 basic letters of the Arabic alphabet, encoded from U+0621 (ARABIC LETTER HAMZA) to U+064A (ARABIC LETTER YEH), which form the skeleton of standard Arabic orthography. Hamza forms are specifically covered in the initial positions U+0621–U+0626, such as U+0621 (HAMZA) and U+0622 (ALEF WITH MADDA ABOVE), enabling the representation of glottal stops and elongated vowels. Core diacritics, essential for vowel indication and phonetic nuance, occupy U+064B–U+0652, including marks like U+064B (FATHATAN) for tanwin fatha (nunation with fatha) and U+0651 (SHADDA) for gemination. Arabic-Indic digits appear in U+0660–U+0669, offering the traditional numeral forms used in Arabic-speaking regions, such as U+0660 (ZERO) and U+0669 (NINE). Special characters in the block include punctuation and symbols for textual and religious contexts, such as U+0600 (ARABIC NUMBER SIGN) for denoting numbers without line breaks, U+0601 (ARABIC SIGN SANAH) for year indications in historical texts, and U+060E (ARABIC POETIC VERSE SIGN) to mark the end of poetic verses. The block's characters are grouped into categories for practical use: letters (primarily U+0621–U+06D5, covering basic and variant forms), combining marks (U+064B–U+065F for diacritics and vowel signs), punctuation (scattered positions like U+060C for comma and U+06D4 for full stop), and digits (U+0660–U+0669 and extended variants). Notably, U+06DD (ARABIC END OF AYAH) is a contextual symbol for Quranic verse endings, with its usage tied to specific rendering behaviors rather than deprecation. Extensions for orthographic variants in non-Arabic languages are addressed in subsequent blocks like Arabic Supplement.
CategoryRangeExamplesNotes
Basic LettersU+0621–U+064AU+0627 (ALEF), U+0633 (SEEN)28 core letters forming the Arabic alphabet
Hamza FormsU+0621–U+0626U+0624 (WAW WITH HAMZA ABOVE)Glottal stop and initial vowel carriers
DiacriticsU+064B–U+0652U+064E (FATHA), U+0650 (KASRA)Vowel and consonant modifiers; combining
DigitsU+0660–U+0669U+0661 (ONE), U+0666 (SIX)Arabic-Indic numerals
Special Punctuation/SymbolsVarious (e.g., U+0600–U+060E)U+060C (COMMA), U+060E (POETIC VERSE SIGN)Includes number sign, sanah, and verse markers

Arabic Supplement

The Arabic Supplement block occupies the Unicode range U+0750 to U+077F, encompassing 48 code points designed to encode variant forms of Arabic letters tailored for non-Arabic languages, with a primary emphasis on orthographies in Africa and Pakistan. Introduced in Unicode 4.1 and expanded in subsequent versions, this block addresses phonetic distinctions in regional scripts by providing precomposed letters rather than combining diacritics. All 48 assigned code points represent joining letters that participate in the cursive Arabic script shaping process, often with modifications to joining behaviors to suit specific linguistic needs. This block supports Arabic-script writing systems for African languages such as Hausa, Fulfulde, Songhay, and Wolof, spoken in regions including Nigeria, Chad, and other parts of Northern and Western Africa. It also includes letters for Berber languages, notably Amazigh in Morocco, as well as Pakistani languages like Khowar and Burushaski. For instance, U+0763 (ARABIC LETTER KEHEH WITH THREE DOTS ABOVE) is used in Berber orthographies to represent a distinct velar sound. Similarly, U+0750 (ARABIC LETTER BEH WITH THREE DOTS HORIZONTALLY BELOW) aids in encoding implosive or glottalized consonants in Hausa and related African languages. Another example is U+077A (ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE), which appears in Burushaski but reflects broader adaptations for tonal or emphatic features in African contexts. The characters emphasize dotted and stroked variants to differentiate sounds, such as additional i'jam (dot clusters) above or below base forms, or strokes like in U+075B (ARABIC LETTER REH WITH STROKE) for fricative realizations in African scripts. No combining diacritics are included; instead, the block prioritizes atomic letters that integrate seamlessly into text rendering while preserving modified joining—typically dual-joining (initial, medial, final, isolated forms)—to accommodate word formation in these languages. This design ensures compatibility with standard Arabic rendering engines, though custom font support may be required for optimal glyph selection in non-Arabic contexts.
Code PointCharacter NameLanguage/Use Example
U+0750ARABIC LETTER BEH WITH THREE DOTS HORIZONTALLY BELOWAfrican languages (e.g., Hausa implosives)
U+075BARABIC LETTER REH WITH STROKEStroked variant for African fricatives
U+0763ARABIC LETTER KEHEH WITH THREE DOTS ABOVEBerber (Amazigh) velars
U+077AARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVEAdaptations in African and Pakistani scripts
Further specialized extensions for African Arabic-script languages appear in the Arabic Extended-B block.

Arabic Extended-A

The Arabic Extended-A block extends the Unicode encoding for the Arabic script by providing characters essential for orthographic variants, regional traditions, and annotation needs in non-standard Arabic usages. Spanning the range U+08A0 to U+08FF, it allocates 96 code points to accommodate these extensions. Introduced in Unicode version 6.1 in 2012, the block was designed to address gaps in representing diverse Arabic-script-based writing systems, particularly those requiring unique letter forms and diacritical marks. Its contents primarily support North African orthographic traditions, such as the Warsh reading, as well as annotation systems for scholarly and religious texts. A core focus of the block is the inclusion of 27 letters and 40 marks among its 78 assigned characters, categorized into orthographic letters with joining behaviors, variant forms, and combining diacritics. Letters often feature modifications like dots or loops to distinguish phonemes in specific languages or dialects, while marks enable precise vowel and tone indications. For instance, the Warsh orthography, prevalent in North and West African Arabic traditions, utilizes characters such as U+08BB ARABIC LETTER LAM WITH DOT ABOVE (ࢻ) to denote variant pronunciations in Quranic recitation and regional texts. Similarly, U+08AA ARABIC LETTER REH WITH LOOP (ࢪ) serves as a stretched variant of the reh letter, used in certain orthographies to elongate forms for aesthetic or phonetic emphasis in manuscript traditions. These letters support joining forms (initial, medial, final) to maintain cursive connectivity in Arabic script rendering. Diacritics in the block enhance annotation capabilities, particularly for religious and linguistic precision. Key examples include U+08E4 ARABIC CURLY FATHA (ࣄ), a stylized short vowel mark employed in Quranic annotations to indicate specific intonations, with further Qur'anic specifics addressed in the Arabic Extended-B block. Other marks, such as those for tone or emphasis in African language orthographies, combine above or below base letters to convey suprasegmental features. The block's design facilitates compatibility with core Arabic blocks, allowing seamless integration for languages like those in North Africa and Central Asia, including variants that align with Pashto orthographic extensions through shared script mechanics. Overall, these elements promote accurate digital representation of historical and contemporary Arabic-script materials without relying on compatibility forms.
CategoryDescriptionRepresentative Examples
LettersModified base letters for regional phonetics, with joining supportU+08A0 ARABIC LETTER BEH WITH SMALL V BELOW (ࢠ); U+08BB ARABIC LETTER LAM WITH DOT ABOVE (ࢻ) for Warsh
Joining FormsContextual variants (initial/medial/final) for cursive flowU+08AA ARABIC LETTER REH WITH LOOP (initial: ࢪ) for stretched orthographies
DiacriticsCombining marks for vowels, tones, and annotationsU+08E4 ARABIC CURLY FATHA (ࣄ) for Quranic readings; U+08CE ARABIC LARGE ROUND DOT ABOVE for emphasis

Arabic Extended-B

The Arabic Extended-B block encompasses the code point range U+0870 to U+089F, comprising 48 positions of which 42 are assigned characters. Introduced in Unicode 14.0 in September 2021, this block addresses needs in Qur'anic typography and orthographic extensions for non-Arabic languages. Its primary purpose is to facilitate precise rendering of Qur'anic annotations, including pause marks and elongation indicators that guide recitation and textual structure, while also providing letter variants for languages such as those spoken in Africa and Southeast Asia. For instance, symbols like the Arabic small high word al-juz (U+0898) denote major sectional pauses in the Quran, and the Arabic doubled madda (U+089E) represents extended vowel lengthening essential for rhythmic intonation. These annotations enhance digital displays of sacred texts by supporting traditional heavy marks, such as the Arabic vertical tail (U+088E), which serves as an abbreviation indicator in religious manuscripts. In addition to Qur'anic features, the block includes letters tailored for non-Arabic scripts, particularly in African contexts like Hausa and Manding, as well as Southeast Asian orthographies such as Pegon for Javanese. Examples encompass the Arabic letter tah with dot below (U+088B), used to represent specific phonemes in African Arabic-based writings, and the Arabic letter keh with two dots vertically below (U+088D), which distinguishes sounds in West African languages. The assigned characters form a mix of approximately 26 letters—primarily alef variants and modified consonants—and 16 marks or symbols, with all applicable forms supporting right-to-left directionality and joining behavior for proper cursive rendering.
CategoryApproximate CountExamplesPurpose
Letters26U+0870 ARABIC LETTER ALEF WITH ATTACHED FATHA; U+088B ARABIC LETTER TAH WITH DOT BELOWPhonetic distinctions in non-Arabic languages, including African scripts
Marks & Symbols16U+088E ARABIC VERTICAL TAIL; U+0898 ARABIC SMALL HIGH WORD AL-JUZQur'anic pauses, elongations, and annotations for recitation guidance
Further extensions for specialized Qur'anic marks appear in Arabic Extended-C.

Arabic Extended-C

The Arabic Extended-C block spans the code point range U+10EC0 to U+10EFF, encompassing 64 positions dedicated to specialized extensions of the Arabic script. Introduced in Unicode 15.0 (2022), it addresses encoding needs for regional variations in Qur'anic recitation and annotation, particularly in Turkey and Libya, while also supporting additional letters for the Pegon script used in Javanese Arabic writing in Indonesia. This block fills critical gaps in representing orthographic traditions that were previously unencoded or approximated, enhancing digital support for religious texts and minority scripts. The block's development reflects ongoing efforts to incorporate diverse Arabic script usages. Unicode 15.0 assigned the initial three characters, all as low-placed Qur'anic marks for Turkish recitation traditions. Unicode 16.0 (2023) added four more, including three variant letters for Pegon and one combining mark for Libyan Qur'ans, increasing the total to seven assigned code points. By Unicode 17.0 (2024), the count reached 21 assigned characters through further proposals, with additional characters provisionally allocated for Unicode 18.0, such as U+10EF9 ARABIC MARK CROWN for decorative marks. Central to the block are Turkish Qur'anic marks, such as Arabic small low word sakta (U+10EFD), which signals a brief pause in reading; Arabic small low word qasr (U+10EFE), indicating shortened vowel pronunciation; and Arabic small low word madda (U+10EFF), denoting elongation. These low-placed diacritics differ from higher variants in earlier blocks, providing precise rendering for Turkish mushafs. For Libyan traditions, the Arabic combining alef overlay (U+10EFC) overlays an alef shape to mark specific recitational features. Pegon letters, used in Indonesian Javanese contexts to adapt Arabic for local languages, include variants like Arabic letter dal with two dots vertically below (U+10EC2), Arabic letter tah with two dots vertically below (U+10EC3), and Arabic letter kaf with two dots vertically below (U+10EC4). Additional Indonesian-specific marks, such as Arabic small yeh barree with two dots below (U+10EC5) for unwritten yeh in Uthmanic rasm and Arabic small low noon (U+10EFB), further support Pegon orthography. The assigned characters emphasize diacritics for Qur'anic precision, variant letters for regional scripts, and ligatures for honorific phrases, exemplified by Arabic ligature alayhaa as-salaatu was-salaam (U+10ED1) and similar forms (U+10ED2–U+10ED8) used in religious texts. Other inclusions, like Arabic letter thin noon (U+10EC6) for medial forms in Warsh orthography and Arabic double vertical bar below (U+10EFA) as a tanween mark in Old Sindhi, highlight niche applications. With only 21 of 64 code points allocated in Unicode 17.0, the block's incomplete coverage underscores potential for future additions to encompass more unencoded variants from global Arabic traditions. It builds on prior extensions like Arabic Extended-B by targeting post-2022 specifics in religious and cultural annotations.
CategoryExamplesUsage ContextCode Points
Turkish Qur'anic MarksSmall low word sakta, qasr, maddaRecitation pauses and vowel adjustments in Turkish mushafsU+10EFD–U+10EFF
Libyan Qur'anic MarksCombining alef overlayOverlay for alef in Libyan readingsU+10EFC
Pegon Letters (Indonesia)Dal/tah/kaf with two dots vertically belowVariant letters in Javanese Arabic scriptU+10EC2–U+10EC4
Indonesian MarksSmall yeh barree with two dots below, small low noonUthmanic rasm and noon indications in PegonU+10EC5, U+10EFB
Honorific LigaturesAlayhaa as-salaatu was-salaam variantsReligious phrases in textsU+10ED1–U+10ED8

Presentation and Compatibility Blocks

Arabic Presentation Forms-A

The Arabic Presentation Forms-A block provides compatibility characters for the Arabic script, encoding precomposed ligatures and contextual forms essential for legacy typesetting and display systems that lack built-in text shaping capabilities. These characters allow direct rendering of joined letter combinations and word forms without requiring complex rendering engines, facilitating compatibility with older software and fonts used for Arabic, Persian, Urdu, and related languages. The block primarily focuses on multi-letter ligatures and variant letter forms, distinguishing it from single-letter positional variants found elsewhere. Spanning the range U+FB50–U+FDFF, the block encompasses 688 code points, with 611 assigned to specific characters. Among its key contents are language-specific letter variants, such as those for Persian and Urdu, exemplified by U+FB56 ARABIC LETTER PEH ISOLATED FORM (پ), which represents the additional letter "peh" not present in standard Arabic. Another example is U+FB92 ARABIC LETTER TTEH ISOLATED FORM for the aspirated "t" sound in languages like Sindhi. These variants typically include isolated and final forms to support basic display in non-shaping environments. A significant portion of the block is dedicated to word ligatures, including 32 religious phrases and honorific expressions commonly used in Islamic texts and formulaic Arabic writing. These precomposed forms, such as U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM (﷽), enable straightforward rendering of the Basmala invocation without decomposition. Other notable examples include U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM (ﷲ) and U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM (ﷺ), which represent common benedictions following mentions of the Prophet Muhammad. Such ligatures preserve traditional calligraphic styles in digital text, particularly for Quranic and devotional content. The characters in this block are grouped by function, including language-specific forms, two- or three-letter joins, and dedicated word ligatures for religious or idiomatic use. The following compact table summarizes representative categories with examples for quick reference:
CategoryDescriptionExamples (Code Point, Name, Glyph)
Persian/Urdu/Sindhi VariantsAdditional letters with isolated/final forms for non-Arabic languagesU+FB56, ARABIC LETTER PEH ISOLATED FORM, پ
U+FB8E, ARABIC LETTER PEH FINAL FORM, ـپ
U+FB92, ARABIC LETTER TTEH ISOLATED FORM, ٹ
Common Letter LigaturesJoined forms of two or more letters in contextual positionsU+FBA6, ARABIC LIGATURE LAM WITH ALEF FINAL FORM, لأ
U+FC0A, ARABIC LIGATURE BEH WITH YEH FINAL FORM, بي
U+FD92, ARABIC LIGATURE MEEM WITH JEEM WITH KHAH INITIAL FORM, مجخ
Religious Word LigaturesPrecomposed phrases for honorifics and invocationsU+FDF0, ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN, ﷰ
U+FDF2, ARABIC LIGATURE ALLAH ISOLATED FORM, ﷲ
U+FDFD, ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM, ﷽

Arabic Presentation Forms-B

The Arabic Presentation Forms-B block in Unicode spans the code point range U+FE70 to U+FEFF, encompassing 144 positions within the Basic Multilingual Plane. Of these, 141 characters are assigned, providing compatibility representations for Arabic script elements that predate advanced shaping engines. This block was introduced to support legacy digital Arabic typography, where fixed glyph forms were necessary due to the absence of dynamic contextual rendering in early systems. The block primarily contains single-letter positional forms for Arabic characters, including isolated, initial, medial, and final variants, as well as spacing and modified diacritic marks. For instance, positional forms of the letter beh (ب) include U+FE8F (ARABIC LETTER BEH ISOLATED FORM, ﺏ), U+FE90 (ARABIC LETTER BEH FINAL FORM, ﺐ), U+FE91 (ARABIC LETTER BEH INITIAL FORM, ﺑ), and U+FE92 (ARABIC LETTER BEH MEDIAL FORM, ﺒ). Modified marks feature variants such as U+FE76 (ARABIC FATHA ISOLATED FORM, ﹶ), U+FE77 (ARABIC FATHA MEDIAL FORM, ﹷ), and U+FE70 (ARABIC FATHATAN ISOLATED FORM, ﹰ), which adjust diacritics like fatha and shadda for positional spacing. These forms ensure consistent appearance in environments lacking glyph substitution rules, such as early word processors or fixed-font displays. Characters in this block are systematically organized by base letter, progressing through their four primary positional states (isolated, final, initial, medial), followed by sections for diacritic variants and combining marks. Diacritics are grouped by type, with isolated, final, and medial forms distinguished to mimic script joining behavior without actual ligation. The block also includes the control character U+FEFF (ZERO WIDTH NO-BREAK SPACE), historically used for byte order marking or as a word joiner, though its application as a joiner is deprecated in favor of U+2060 (WORD JOINER). While these precomposed forms remain useful for round-trip compatibility in legacy data interchange, modern Unicode-conformant systems prefer generating such variants dynamically from the core Arabic block (U+0600–U+06FF).
CategoryExample Code PointsRepresentative Glyphs and Names
Positional Letter Forms (e.g., Beh)U+FE8F–U+FE92ﺏ (isolated), ﺐ (final), ﺑ (initial), ﺒ (medial)
Diacritic Variants (e.g., Fatha)U+FE70, U+FE76–U+FE77ﹰ (fathatan isolated), ﹶ (fatha isolated), ﹷ (fatha medial)
Other MarksU+FE71, U+FE7Cﹱ (tatweel with fathatan), ﹼ (shadda isolated)
This tabular summary highlights the block's focus on single-glyph compatibility, excluding multi-part ligatures.

Numeral and Historical Symbol Blocks

Rumi Numeral Symbols

The Rumi Numeral Symbols block occupies the Unicode range U+10E60–U+10E7F, encompassing 32 code points of which 31 are assigned. This block encodes characters for the Rumi numeral system, a historical additive numbering method with limited positional features, employed primarily in North Africa (notably Fez, Morocco) and al-Andalus (Iberian Peninsula) from the 10th to 17th centuries CE. The system, also known as Fasi or zimam numerals, derives from Coptic or Greek-Coptic traditions and was used in Arabic-script manuscripts for foliation, chapter notations, accounting records, astronomical instruments, and mathematical calculations. It supports scholarly analysis of historical Islamic science, mathematics, and commerce by providing digital representations of these non-positional symbols. Introduced in Unicode version 5.2 (October 2009), the block facilitates the preservation and rendering of Rumi numerals in modern digital texts, particularly for academic and archival purposes. Unlike positional decimal systems, Rumi numerals operate additively: base symbols represent units, tens, or hundreds, with higher orders like thousands (up to 9000) formed by placing one or more horizontal bars beneath the base numeral. For instance, a bar under the symbol for 3 denotes 3000. Fractions are indicated by special symbols or by a slash separating numerator (positioned top-right) and denominator (bottom-left) relative to a base numeral. The block's characters are categorized into three main groups: digits for units 1–9, higher-order numbers for tens through hundreds, and fractions. Digits include U+10E60 𐹠 (Rumi Digit One) through U+10E68 𐹨 (Rumi Digit Nine). Higher units range from U+10E69 𐹩 (Rumi Number Ten) to U+10E72 𐹲 (Rumi Number Ninety), and from U+10E73 𐹳 (Rumi Number One Hundred) to U+10E7A 𐹺 (Rumi Number Nine Hundred). Fractions comprise U+10E7B 𐹻 (Rumi Fraction One Half), U+10E7C 𐹼 (Rumi Fraction One Quarter), U+10E7D 𐹽 (Rumi Fraction One Third), and U+10E7E 𐹾 (Rumi Fraction Two Thirds).
CategoryCode PointsExamplesDescription
Digits (1–9)U+10E60–U+10E68𐹠 (One), 𐹤 (Five), 𐹨 (Nine)Basic unit symbols forming the foundation of additive combinations.
Numbers (10–900)U+10E69–U+10E7A𐹩 (Ten), 𐹰 (Fifty), 𐹺 (Nine Hundred)Symbols for tens, hundreds, and their multiples; used with digits for larger values.
FractionsU+10E7B–U+10E7E𐹻 (One Half), 𐹽 (One Third)Dedicated glyphs for common fractions in accounting and measurements; additional fractions via contextual notation.
These symbols, drawn from manuscripts like those of 13th-century scholar Ibn al-Banna, exhibit glyph variations across regions but maintain consistent numerical values in Unicode encoding. The block's design ensures compatibility with Arabic-script rendering engines, though font support remains limited outside specialized collections.

Indic Siyaq Numbers

The Indic Siyaq Numbers block provides encoding for a historical numeral system derived from Arabic script influences, specifically tailored for accounting practices in the Indian subcontinent. This system emerged in South Asia by the 17th century under Mughal administration and persisted into the mid-20th century, appearing in printed works such as Gladwin's 1790 grammar of Persian and Urdu. It was particularly used in ledgers and financial records during the Mughal and British colonial eras, often in conjunction with Urdu and Persian scripts for decimal-based calculations in trade and administration. The block spans the code point range U+1EC70 to U+1ECBF, allocating 80 positions, of which 68 are assigned to symbols. These characters were added in Unicode version 11.0, released in June 2018, to support digitization of historical manuscripts and archival documents. The notation is additive and non-positional, written from right to left with the largest denomination first, allowing representation of values through summation rather than strict place value. Key contents include base units for numbers 1 through 9 (e.g., U+1EC71 𞱱 INDIC SIYAQ NUMBER ONE), as well as symbols for tens (U+1EC7A–U+1EC82, representing 10 to 90), hundreds (U+1EC83–U+1EC8B, 100 to 900), thousands (U+1EC8C–U+1EC94, 1,000 to 9,000), and higher powers like ten thousands (U+1EC95–U+1EC9D, 10,000 to 90,000). Additional characters cover larger denominations such as the lakh (U+1EC9E 𞲞, 100,000) and crore (U+1EC9F 𞲟, 10,000,000), along with modifiers including prefixed and alternate forms for compactness, rupee mark (U+1ECB0 𞲰), alternate number forms such as one (U+1ECB1 𞲱), two (U+1ECB2 𞲲), and ten thousand (U+1ECB3 𞲳), alternate lakh mark (U+1ECB4 𞲴), and fractions such as one quarter (U+1ECAD 𞲭), one half (U+1ECAE 𞲮), and three quarters (U+1ECAF 𞲯). The system has no dedicated zero symbol. This structure enables encoding of complex additive expressions, such as combining U+1EC71 (1) and U+1EC7A (10) to denote 11, facilitating the preservation of authentic historical notations without modern positional adaptations.

Ottoman Siyaq Numbers

The Ottoman Siyaq Numbers block encodes a specialized historical numeral system based on stylized monograms derived from Arabic number names, used in Ottoman Turkish documents for decimal notation in financial accounting and astronomical calculations. This system, known as Siyakat, supplemented the standard Arabic script in bureaucratic records across the Ottoman Empire, where it facilitated rapid computation and notation in right-to-left additive sequences without a zero symbol. Primary numerals (1–9) were combined with order-specific markers—such as hooks for tens or specific ligatures for hundreds—to form higher values, often in stacked or horizontally adjacent configurations for compound numbers like 555 (encoded as a sequence of U+1ED0E 𞴎, U+1ED05 𞴅, and U+1ED17 𞴗). Introduced in Unicode version 12.0 in March 2019, the block spans the range U+1ED01–U+1ED3D within the Supplementary Multilingual Plane, allocating 61 code points with all assigned to characters. It includes 59 core symbols across subcategories: primary numbers (U+1ED01–U+1ED09, e.g., U+1ED01 𞴁 for one), tens (U+1ED0A–U+1ED12, e.g., U+1ED0A 𞴊 for ten), hundreds (U+1ED13–U+1ED1B, e.g., U+1ED13 𞴓 for one hundred), thousands (U+1ED1C–U+1ED24, e.g., U+1ED1C 𞴜 for one thousand), ten thousands (U+1ED25–U+1ED2D, e.g., U+1ED25 𞴥 for ten thousand), and additional forms for alternates, multipliers (e.g., U+1ED2E 𞴮 for marratan, denoting millions), fractions (e.g., U+1ED3C 𞴼 for one half and U+1ED3D 𞴽 for one sixth), and division marks. These precomposed atomic characters support direct representation of values up to billions when sequenced, preserving the original system's positional logic where the highest order precedes lower ones, except in transposed tens-and-units compounds. In original manuscripts, Ottoman Siyaq symbols often appeared in vertically stacked arrangements for dense fiscal tables or astronomical tables, requiring specialized calligraphic rendering; Unicode encodes them as non-joining right-to-left characters (bidi class AL) without OpenType shaping, relying on font glyphs for visual fidelity to these stacked forms. Multipliers like U+1ED2E enable extensions beyond ten thousands (e.g., 5,000,000 as U+1ED1C 𞴜 followed by U+1ED2E 𞴮 and U+1ED20 𞴠), while division marks and fractions supported precise notations in accounting ledgers and celestial computations. This contrasts briefly with the simpler, non-stacked additive variants in the Indic Siyaq Numbers block. The encoding facilitates digital preservation and scholarly analysis of 16th- to 19th-century Ottoman texts, where Siyaq was prevalent in administrative and scientific contexts.

Mathematical and Specialized Symbols

Arabic Mathematical Alphabetic Symbols

The Arabic Mathematical Alphabetic Symbols block occupies the Unicode range U+1EE00–U+1EEFF in the Supplementary Multilingual Plane, encompassing 256 code points of which 143 are assigned to characters. Introduced in Unicode version 6.1 in January 2012, this block encodes stylized variants of Arabic letters tailored for mathematical expressions, drawing from historical Arabic mathematical traditions where letters function as variables, unknowns, or coefficients with distinctive calligraphic modifications. These symbols provide forms for letters of the Arabic alphabet in six primary styles, with coverage varying by style: isolated forms (U+1EE00–U+1EE14), initial forms (U+1EE21–U+1EE3B), tailed forms (U+1EE42–U+1EE5F), stretched forms (U+1EE61–U+1EE7E), looped forms (U+1EE80–U+1EE9B), and double-struck forms (U+1EEA1–U+1EEBB). Each style adapts the letter shapes to convey mathematical intent, such as elongation for summation or looping for specific operators in classical texts. Key examples include the isolated alef at U+1EE00 (𞸀, ARABIC MATHEMATICAL ALEF) and the initial beh at U+1EE21 (𞸡, ARABIC MATHEMATICAL INITIAL BEH), with glyphs designed to integrate seamlessly in right-to-left mathematical layouts. The block supports precise typesetting of Arabic-script mathematics in digital formats, aligning with Unicode's mathematical typography framework as outlined in Unicode Technical Report #25, and enabling compatibility with rendering engines like those in MathML or OpenType math fonts. In LaTeX environments, these characters are accessible via the unicode-math package when paired with fonts such as XITS Math, which includes glyphs for Arabic mathematical variants. This facilitates the reproduction of both historical manuscripts and modern Arabic mathematical notation without relying on ad hoc approximations.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.