Recent from talks
Nothing was collected or created yet.
Arabic script in Unicode
View on WikipediaMany scripts in Unicode, such as Arabic, have special orthographic rules that require certain combinations of letterforms to be combined into special ligature forms. In English, the common ampersand (&) developed from a ligature in which the handwritten Latin letters e and t (spelling et, Latin for and) were combined.[1] The rules governing ligature formation in Arabic can be quite complex, requiring special script-shaping technologies such as the Arabic Calligraphic Engine by Thomas Milo's DecoType.[2]
As of Unicode 17.0, the Arabic script is contained in the following blocks:[3]
- Arabic (0600–06FF, 256 characters)
- Arabic Supplement (0750–077F, 48 characters)
- Arabic Extended-B (0870–089F, 43 characters)
- Arabic Extended-A (08A0–08FF, 96 characters)
- Arabic Presentation Forms-A (FB50–FDFF, 656 characters)
- Arabic Presentation Forms-B (FE70–FEFF, 141 characters)
- Rumi Numeral Symbols (10E60–10E7F, 31 characters)
- Arabic Extended-C (10EC0-10EFF, 21 characters)
- Indic Siyaq Numbers (1EC70–1ECBF, 68 characters)
- Ottoman Siyaq Numbers (1ED00–1ED4F, 61 characters)
- Arabic Mathematical Alphabetic Symbols (1EE00–1EEFF, 143 characters)
The basic Arabic range encodes the standard letters and diacritics, but does not encode contextual forms (U+0621–U+0652 being directly based on ISO 8859-6); and also includes the most common diacritics and Arabic-Indic digits. The Arabic Supplement range encodes letter variants mostly used for writing African (non-Arabic) languages. The Arabic Extended-B and Arabic Extended-A ranges encode additional Qur'anic annotations and letter variants used for various non-Arabic languages. The Arabic Presentation Forms-A range encodes contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The Arabic Presentation Forms-B range encodes spacing forms of Arabic diacritics, and more contextual letter forms. The presentation forms are present only for compatibility with older standards, and are not currently needed for coding text.[4] The Arabic Mathematical Alphabetical Symbols block encodes characters used in Arabic mathematical expressions. The Indic Siyaq Numbers block contains a specialized subset of Arabic script that was used for accounting in India under the Mughal Empire by the 17th century through the middle of the 20th century.[5][6] The Ottoman Siyaq Numbers block contains a specialized subset of Arabic script, also known as Siyakat numbers, used for accounting in Ottoman Turkish documents.[6]
Contextual forms
[edit]Below is a demonstration for the basic alphabet used in Modern Standard Arabic illustrating how Arabic letters are expected to appear in different contexts. Codepoints listed as contextual forms should "should not be used in general interchange"[4]. Unicode has other methods of encoding the difference if necessary, such as Zero-width joiner.
| General Unicode |
Contextual forms | Name | |||
|---|---|---|---|---|---|
| Isolated | Final (End) | Medial (Middle) | Initial (Beginning) | ||
| 0627 ا |
FE8D ﺍ |
FE8E ﺎ |
ʾalif | ||
| 0628 ب |
FE8F ﺏ |
FE90 ﺐ |
FE92 ﺒ |
FE91 ﺑ |
bāʾ |
| 062A ت |
FE95 ﺕ |
FE96 ﺖ |
FE98 ﺘ |
FE97 ﺗ |
tāʾ |
| 062B ث |
FE99 ﺙ |
FE9A ﺚ |
FE9C ﺜ |
FE9B ﺛ |
ṯāʾ |
| 062C ج |
FE9D ﺝ |
FE9E ﺞ |
FEA0 ﺠ |
FE9F ﺟ |
ǧīm |
| 062D ح |
FEA1 ﺡ |
FEA2 ﺢ |
FEA4 ﺤ |
FEA3 ﺣ |
ḥāʾ |
| 062E خ |
FEA5 ﺥ |
FEA6 ﺦ |
FEA8 ﺨ |
FEA7 ﺧ |
ḫāʾ |
| 062F د |
FEA9 ﺩ |
FEAA ﺪ |
dāl | ||
| 0630 ذ |
FEAB ﺫ |
FEAC ﺬ |
ḏāl | ||
| 0631 ر |
FEAD ﺭ |
FEAE ﺮ |
rāʾ | ||
| 0632 ز |
FEAF ﺯ |
FEB0 ﺰ |
zayn/zāy | ||
| 0633 س |
FEB1 ﺱ |
FEB2 ﺲ |
FEB4 ﺴ |
FEB3 ﺳ |
sīn |
| 0634 ش |
FEB5 ﺵ |
FEB6 ﺶ |
FEB8 ﺸ |
FEB7 ﺷ |
šīn |
| 0635 ص |
FEB9 ﺹ |
FEBA ﺺ |
FEBC ﺼ |
FEBB ﺻ |
ṣād |
| 0636 ض |
FEBD ﺽ |
FEBE ﺾ |
FEC0 ﻀ |
FEBF ﺿ |
ḍād |
| 0637 ط |
FEC1 ﻁ |
FEC2 ﻂ |
FEC4 ﻄ |
FEC3 ﻃ |
ṭāʾ |
| 0638 ظ |
FEC5 ﻅ |
FEC6 ﻆ |
FEC8 ﻈ |
FEC7 ﻇ |
ẓāʾ |
| 0639 ع |
FEC9 ﻉ |
FECA ﻊ |
FECC ﻌ |
FECB ﻋ |
ʿayn |
| 063A غ |
FECD ﻍ |
FECE ﻎ |
FED0 ﻐ |
FECF ﻏ |
ġayn |
| 0641 ف |
FED1 ﻑ |
FED2 ﻒ |
FED4 ﻔ |
FED3 ﻓ |
fāʾ |
| 0642 ق |
FED5 ﻕ |
FED6 ﻖ |
FED8 ﻘ |
FED7 ﻗ |
qāf |
| 0643 ك |
FED9 ﻙ |
FEDA ﻚ |
FEDC ﻜ |
FEDB ﻛ |
kāf |
| 0644 ل |
FEDD ﻝ |
FEDE ﻞ |
FEE0 ﻠ |
FEDF ﻟ |
lām |
| 0645 م |
FEE1 ﻡ |
FEE2 ﻢ |
FEE4 ﻤ |
FEE3 ﻣ |
mīm |
| 0646 ن |
FEE5 ﻥ |
FEE6 ﻦ |
FEE8 ﻨ |
FEE7 ﻧ |
nūn |
| 0647 ه |
FEE9 ﻩ |
FEEA ﻪ |
FEEC ﻬ |
FEEB ﻫ |
hāʾ |
| 0648 و |
FEED ﻭ |
FEEE ﻮ |
wāw | ||
| 064A ي |
FEF1 ﻱ |
FEF2 ﻲ |
FEF4 ﻴ |
FEF3 ﻳ |
yāʾ |
| 0622 آ |
FE81 ﺁ |
FE82 ﺂ |
ʾalif maddah | ||
| 0629 ة |
FE93 ﺓ |
FE94 ﺔ |
— | — | Tāʾ marbūṭah |
| 0649 ى |
FEEF ﻯ |
FEF0 ﻰ |
— | — | ʾalif maqṣūrah |
Punctuation and ornaments
[edit]Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma ⟨,⟩ which is also used as the decimal separator when the Eastern Arabic numerals are used (e.g. ⟨100.6⟩ compared to ⟨١٠٠,٦⟩).
- U+060C ، ARABIC COMMA
- U+060D ؍ ARABIC DATE SEPARATOR
- U+060E ؎ ARABIC POETIC VERSE SIGN
- U+060F ؏ ARABIC SIGN MISRA
- U+061B ؛ ARABIC SEMICOLON
- U+061E ؞ ARABIC TRIPLE DOT PUNCTUATION MARK
- U+061F ؟ ARABIC QUESTION MARK
- U+066D ٭ ARABIC FIVE POINTED STAR
- U+06D4 ۔ ARABIC FULL STOP
- U+06DD ARABIC END OF AYAH
- U+06DE ۞ ARABIC START OF RUB EL HIZB
- U+06E9 ۩ ARABIC PLACE OF SAJDAH
- U+06FD ۽ ARABIC SIGN SINDHI AMPERSAND
- U+FD3E ﴾ Arabic ornate left parenthesis
- U+FD3F ﴿ Arabic ornate right parenthesis
Word ligatures
[edit]Arabic Presentation Forms-A has a few characters defined as "word ligatures" for terms frequently used in formulaic expressions in Arabic. They are rarely used out of professional liturgical typing, also the Rial grapheme is normally written fully, not by the ligature.
- U+FDF0 ﷰ ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN ISOLATED FORM (صلى, stylized as صلے)
- U+FDF1 ﷱ ARABIC LIGATURE QALA USED AS KORANIC STOP SIGN ISOLATED FORM (قلى, stylized as قلے)
- U+FDF2 ﷲ ARABIC LIGATURE ALLAH ISOLATED FORM (اللّٰه)
- U+FDF3 ﷳ ARABIC LIGATURE AKBAR ISOLATED FORM (اكبر), as in the phrase الله اكبر Allāhu akbar
- U+FDF4 ﷴ ARABIC LIGATURE MOHAMMAD ISOLATED FORM (محمد)
- U+FDF5 ﷵ ARABIC LIGATURE SALAM ISOLATED FORM (صلعم, the abbreviation for صلى الله عليه وسلم "peace be upon him")
- U+FDF6 ﷶ ARABIC LIGATURE RASOUL ISOLATED FORM (رسول)
- U+FDF7 ﷷ ARABIC LIGATURE ALAYHE ISOLATED FORM (عليه)
- U+FDF8 ﷸ ARABIC LIGATURE WASALLAM ISOLATED FORM (وسلم)
- U+FDF9 ﷹ ARABIC LIGATURE SALLA ISOLATED FORM (صلى)
- U+FDFA ﷺ ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM (صلى الله عليه وسلم "peace be upon him")
- U+FDFB ﷻ ARABIC LIGATURE JALLAJALALOUHOU (جل جلاله)
- U+FDFC ﷼ RIAL SIGN (ريال)
- U+FDFD ﷽ ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM (بسم الله الرحمن الرحيم bism-i llāh-i r-raḥmān-i r-raḥīm)
Code blocks
[edit]Arabic
[edit]Character table
[edit]| Code | Result | Unicode name |
|---|---|---|
| U+0600 | | Arabic Number Sign |
| U+0601 | | Arabic Sign Sanah |
| U+0602 | | Arabic Footnote Marker |
| U+0603 | | Arabic Sign Safha |
| U+0604 | | Arabic Sign Samvat
used for writing Samvat era dates in Urdu |
| U+0605 | | Arabic Number Mark Above
may be used with Coptic Epact numbers |
| U+0606 | ؆ | Arabic-Indic Cube Root
→ U+221B ∛ Cube Root |
| U+0607 | ؇ | Arabic-Indic Fourth Root
→ U+221C ∜ Fourth Root |
| U+0608 | ؈ | Arabic Ray |
| U+0609 | ؉ | Arabic-Indic Per Mille Sign
→ U+2030 ‰ Per Mille Sign |
| U+060A | ؊ | Arabic-Indic Per Ten Thousand Sign
→ U+2031‱ Per Ten Thousand Sign |
| U+060B | ؋ | Afghani Sign |
| U+060C | ، | Arabic Comma
also used with Thaana and Syriac in modern text → U+002C, Comma → U+2E32 ⸲ Turned Comma → U+2E41 ⹁ Reversed Comma |
| U+060D | ؍ | Arabic Date Separator |
| U+060E | ؎ | Arabic Poetic Verse Sign |
| U+060F | ؏ | Arabic Sign Misra |
| U+0610 | ؐ | Arabic Sign Sallallahou Alayhe Wassallam
represents sallallahu alayhe wasallam "may God's peace and blessings be upon him" |
| U+0611 | ؑ | Arabic Sign Alayhe Assallam
represents alayhe assalam "upon him be peace" |
| U+0612 | ؒ | Arabic Sign Rahmatullah Alayhe
represents rahmatullah alayhe "may God have mercy upon him" |
| U+0613 | ؓ | Arabic Sign Radi Allahou Anhu
represents radi allahu 'anhu "may God be pleased with him" |
| U+0614 | ؔ | Arabic Sign Takhallus
sign placed over the name or nom-de-plume of a poet, or in some writings used to mark all proper names |
| U+0615 | ؕ | Arabic Small High Tah
marks a recommended pause position in some Qurans published in Iran and Pakistan should not be confused with the small TAH sign used as a diacritic for some letters such as 0679 |
| U+0616 | ؖ | Arabic Small High Ligature Alef With Lam With Yeh
early Persian Arabic Small High Ligature Alef With Yeh Barree |
| U+0617 | ؗ | Arabic Small High Zain |
| U+0618 | ؘ | Arabic Small Fatha
should not be confused with 064E Fatha |
| U+0619 | ؙ | Arabic Small Damma
should not be confused with 064F Damma |
| U+061A | ؚ | Arabic Small Kasra
should not be confused with 0650 Kasra |
| U+061B | ؛ | Arabic Semicolon
also used with Thaana and Syriac in modern text → U+003B ; Semicolon → U+204F ⁏ Reversed Semicolon → U+2E35 ⸵ Turned Semicolon |
| U+061C | | Arabic Letter Mark (Alm) |
| U+061D | ؝ | Arabic End Of Text Mark |
| U+061E | ؞ | Arabic Triple Dot Punctuation Mark |
| U+061F | ؟ | Arabic Question Mark
also used with Thaana and Syriac in modern text → U+003F ? Question Mark → U+2E2E ⸮ Reversed Question Mark |
| U+0620 | ؠ | Arabic Letter Kashmiri Yeh |
| U+0621 | ء | Arabic Letter Hamza
→ U+02BE ʾ Modifier Letter Right Half Ring |
| U+0622 | آ | Arabic Letter Alef With Madda Above
≡ آ U+0627 U+0653 |
| U+0623 | أ | Arabic Letter Alef With Hamza Above
≡ أ U+0627 U+0654 |
| U+0624 | ؤ | Arabic Letter Waw With Hamza Above
≡ ؤ U+0648 U+0654 |
| U+0625 | إ | Arabic Letter Alef With Hamza Below
≡ إ U+0627 U+0655 |
| U+0626 | ئ | Arabic Letter Yeh With Hamza Above
in Kyrgyz the hamza is consistently positioned to the top right in isolate and final forms ≡ ئ U+064A U+0654 |
| U+0627 | ا | Arabic Letter Alef |
| U+0628 | ب | Arabic Letter Beh |
| U+0629 | ة | Arabic Letter Teh Marbuta |
| U+062A | ت | Arabic Letter Teh |
| U+062B | ث | Arabic Letter The |
| U+062C | ج | Arabic Letter Jeem |
| U+062D | ح | Arabic Letter Hah |
| U+062E | خ | Arabic Letter Khah |
| U+062F | د | Arabic Letter Dal |
| U+0630 | ذ | Arabic Letter Thal |
| U+0631 | ر | Arabic Letter Reh |
| U+0632 | ز | Arabic Letter Zain |
| U+0633 | س | Arabic Letter Seen |
| U+0634 | ش | Arabic Letter Sheen |
| U+0635 | ص | Arabic Letter Sad |
| U+0636 | ض | Arabic Letter Dad |
| U+0637 | ط | Arabic Letter Tah |
| U+0638 | ظ | Arabic Letter Zah |
| U+0639 | ع | Arabic Letter Ain
→ U+01B9 ƹ Latin Small Letter Ezh Reversed → U+02BF ʿ MODIFIER LETTER LEFT HALF RING |
| U+063A | غ | Arabic Letter Ghain |
| U+063B | ػ | Arabic Letter Keheh With Two Dots Above |
| U+063C | ؼ | Arabic Letter Keheh With Three Dots Below |
| U+063D | ؽ | Arabic Letter Farsi Yeh With Inverted V
Azerbaijani |
| U+063E | ؾ | Arabic Letter Farsi Yeh With Two Dots Above |
| U+063F | ؿ | Arabic Letter Farsi Yeh With Three Dots Above |
| U+0640 | ـ | Arabic Tatweel
inserted to stretch characters or to carry tashkil with no base letter also used with Adlam, Hanifi Rohingya, Mandaic, Manichaean, Psalter Pahlavi, Sogdian, and Syriac= kashida |
| U+0641 | ف | Arabic Letter Feh |
| U+0642 | ق | Arabic Letter Qaf |
| U+0643 | ك | Arabic Letter Kaf |
| U+0644 | ل | Arabic Letter Lam |
| U+0645 | م | Arabic Letter Meem
Sindhi uses a shape with a short tail |
| U+0646 | ن | Arabic Letter Noon |
| U+0647 | ه | Arabic Letter Heh |
| U+0648 | و | Arabic Letter Waw |
| U+0649 | ى | Arabic Letter Alef Maksura
represents YEH-shaped dual-joining letter with no dots in any positional form not intended for use in combination with 0654 → U+0626 ئ Arabic Letter Yeh With Hamza Above |
| U+064A | ي | Arabic Letter Yeh
loses its dots when used in combination with 0654 retains its dots when used in combination with other combining marks → U+08A8 ࢨ Arabic Letter Yeh With Two Dots Below And Hamza Above |
| U+064B | ً | Arabic Fathatan |
| U+064C | ٌ | Arabic Dammatan
a common alternative form is written as two intertwined dammas, one of which is turned 180 degrees |
| U+064D | ٍ | Arabic Kasratan |
| U+064E | َ | Arabic Fatha |
| U+064F | ُ | Arabic Damma |
| U+0650 | ِ | Arabic Kasra |
| U+0651 | ّ | Arabic Shadda |
| U+0652 | ْ | Arabic Sukun
marks absence of a vowel after the base consonant used in some Qurans to mark a long vowel as ignored can have a variety of shapes, including a circular one and a shape that looks like '06E1' → U+06E1 ۡArabic Small High Dotless Head Of Khah |
| U+0653 | ٓ | Arabic Maddah Above
used for madd jaa'iz in South Asian and Indonesian orthographies →U+089C ࢜ Arabic Madda Waajib →U+089E ࢞ Arabic Doubled Madda →U+089F ࢟ Arabic Half Madda Over Madda |
| U+0654 | ٔ | Arabic Hamza Above
restricted to hamza and ezafe semantics is not used as a diacritic to form new letters |
| U+0655 | ٕ | Arabic Hamza Below |
| U+0656 | ٖ | Arabic Subscript Alef |
| U+0657 | ٗ | Arabic Inverted Damma
Kashmiri, Urdu, Swahili, Somali |
| U+0658 | ٘ | Arabic Mark Noon Ghunna
Baluchi indicates nasalization in Urdu |
| U+0659 | ٙ | Arabic Zwarakay
Pashto |
| U+065A | ٚ | Arabic Vowel Sign Small V Above
African languages |
| U+065B | ٛ | Arabic Vowel Sign Inverted Small V Above
African languages |
| U+065C | ٜ | Arabic Vowel Sign Dot Below
African languages also used in Quranic text in African and other orthographies |
| U+065D | ٝ | Arabic Reversed Damma
African languages |
| U+065E | ٞ | Arabic Fatha With Two Dots
Kalami |
| U+065F | ٟ | Arabic Wavy Hamza Below
Kashmiri |
| U+0660 | ٠ | Arabic-Indic Digit Zero |
| U+0661 | ١ | Arabic-Indic Digit One |
| U+0662 | ٢ | Arabic-Indic Digit Two |
| U+0663 | ٣ | Arabic-Indic Digit Three |
| U+0664 | ٤ | Arabic-Indic Digit Four |
| U+0665 | ٥ | Arabic-Indic Digit Five |
| U+0666 | ٦ | Arabic-Indic Digit Six |
| U+0667 | ٧ | Arabic-Indic Digit Seven |
| U+0668 | ٨ | Arabic-Indic Digit Eight |
| U+0669 | ٩ | Arabic-Indic Digit Nine |
| U+066A | ٪ | Arabic Percent Sign
→ U+0025 % Percent Sign |
| U+066B | ٫ | Arabic Decimal Separator
the ordinary comma is most commonly used instead → U+002C, Comma |
| U+066C | ٬ | Arabic Thousands Separator
the Arabic comma is most commonly used instead → U+060C ، Arabic Comma → U+0027 ' Apostrophe → U+2019 ’ Right Single Quotation Mark |
| U+066D | ٭ | Arabic Five Pointed Star
appearance rather variable → U+002A * Asterisk |
| U+066E | ٮ | Arabic Letter Dotless Beh |
| U+066F | ٯ | Arabic Letter Dotless Qaf |
| U+0670 | ٰ | Arabic Letter Superscript Alef |
| U+0671 | ٱ | Arabic Letter Alef Wasla
Quranic Arabic |
| U+0672 | ٲ | Arabic Letter Alef With Wavy Hamza Above
Baluchi, Kashmiri |
| U+0673 | ٳ | Arabic Letter Alef With Wavy Hamza Below (deprecated)[7] Kashmiri
This character is deprecated and its use is strongly discouraged; the sequence 0627 065F is the preferred way of encoding this character. |
| U+0674 | ٴ | Arabic Letter High Hamza
Kazakh, Jawi forms digraphs |
| U+0675 | ٵ | Arabic Letter High Hamza Alef
preferred spelling is ٴا U+0674 U+0627 |
| U+0676 | ٶ | Arabic Letter High Hamza Waw
preferred spelling is ٴو U+0674 U+0648 |
| U+0677 | ٷ | Arabic Letter U With Hamza Above
preferred spelling is ٴۇ U+0674 U+06C7 |
| U+0678 | ٸ | Arabic Letter High Hamza Yeh
preferred spelling is ٴی U+0674 06CC |
| U+0679 | ٹ | Arabic Letter Tteh
Urdu |
| U+067A | ٺ | Arabic Letter Tteheh
Sindhi |
| U+067B | ٻ | Arabic Letter Beeh
Sindhi |
| U+067C | ټ | Arabic Letter Teh With Ring
Pashto |
| U+067D | ٽ | Arabic Letter Teh With Three Dots Above Downwards
Sindhi |
| U+067E | پ | Arabic Letter Peh
Persian, Urdu, ... |
| U+067F | ٿ | Arabic Letter Teheh
Sindhi |
| U+0680 | ڀ | Arabic Letter Beheh
Sindhi |
| U+0681 | ځ | Arabic Letter Hah With Hamza Above
Pashto, Sarikoli represents the phoneme /dz/ |
| U+0682 | ڂ | Arabic Letter Hah With Two Dots Vertical Above
not used in modern Pashto |
| U+0683 | ڃ | Arabic Letter Nyeh
Sindhi |
| U+0684 | ڄ | Arabic Letter Dyeh
Sindhi, historically Bosnian |
| U+0685 | څ | Arabic Letter Hah With Three Dots Above
Pashto, Khwarazmian, Sarikoli represents the phoneme /ts/ in Pashto |
| U+0686 | چ | Arabic Letter Tcheh
Persian, Urdu, ... |
| U+0687 | ڇ | Arabic Letter Tcheheh
Sindhi |
| U+0688 | ڈ | Arabic Letter Ddal
Urdu |
| U+0689 | ډ | Arabic Letter Dal With Ring
Pashto |
| U+068A | ڊ | Arabic Letter Dal With Dot Below
Sindhi, early Persian, Pegon, Malagasy |
| U+068B | ڋ | Arabic Letter Dal With Dot Below And Small Tah
Lahnda |
| U+068C | ڌ | Arabic Letter Dahal
Sindhi |
| U+068D | ڍ | Arabic Letter Ddahal
Sindhi |
| U+068E | ڎ | Arabic Letter Dul
older shape for DUL, now obsolete in Sindhi Burushaski |
| U+068F | ڏ | Arabic Letter Dal With Three Dots Above Downwards
Sindhi current shape used for DUL |
| U+0690 | ڐ | Arabic Letter Dal With Four Dots Above
Old Urdu, not in current use |
| U+0691 | ڑ | Arabic Letter Rreh
Urdu |
| U+0692 | ڒ | Arabic Letter Reh With Small V
Kurdish |
| U+0693 | ړ | Arabic Letter Reh With Ring
Pashto |
| U+0694 | ڔ | Arabic Letter Reh With Dot Below
Kurdish, early Persian |
| U+0695 | ڕ | Arabic Letter Reh With Small V Below
Kurdish |
| U+0696 | ږ | Arabic Letter Reh With Dot Below And Dot Above
Pashto |
| U+0697 | ڗ | Arabic Letter Reh With Two Dots Above
Dargwa |
| U+0698 | ژ | Arabic Letter Jeh
Persian, Urdu, ... |
| U+0699 | ڙ | Arabic Letter Reh With Four Dots Above
Sindhi |
| U+069A | ښ | Arabic Letter Seen With Dot Below And Dot Above
Pashto |
| U+069B | ڛ | Arabic Letter Seen With Three Dots Below
early Persian |
| U+069C | ڜ | Arabic Letter Seen With Three Dots Below And Three Dots Above
Moroccan Arabic |
| U+069D | ڝ | Arabic Letter Sad With Two Dots Below
Turkic |
| U+069E | ڞ | Arabic Letter Sad With Three Dots Above
Berber, Burushaski |
| U+069F | ڟ | Arabic Letter Tah With Three Dots Above
Old Hausa |
| U+06A0 | ڠ | Arabic Letter Ain With Three Dots Above
Jawi |
| U+06A1 | ڡ | Arabic Letter Dotless Feh
Adighe |
| U+06A2 | ڢ | Arabic Letter Feh With Dot Moved Below
Maghrib Arabic |
| U+06A3 | ڣ | Arabic Letter Feh With Dot Below
Ingush |
| U+06A4 | ڤ | Arabic Letter Veh
Middle Eastern Arabic for foreign words Kurdish, Khwarazmian, early Persian, Jawi |
| U+06A5 | ڥ | Arabic Letter Feh With Three Dots Below
North African Arabic for foreign words |
| U+06A6 | ڦ | Arabic Letter Peheh
Sindhi |
| U+06A7 | ڧ | Arabic Letter Qaf With Dot Above
Maghrib Arabic, Uyghur |
| U+06A8 | ڨ | Arabic Letter Qaf With Three Dots Above
Tunisian and Algerian Arabic |
| U+06A9 | ک | Arabic Letter Keheh
Persian, Urdu, Sindhi, ...= kaf mashkula |
| U+06AA | ڪ | Arabic Letter Swash Kaf
represents a letter distinct from Arabic KAF (0643) in Sindhi |
| U+06AB | ګ | Arabic Letter Kaf With Ring
Pashto may appear like an Arabic KAF (0643) with a ring below the base |
| U+06AC | ڬ | Arabic Letter Kaf With Dot Above
use for the Jawi gaf is not recommended, although it may be found in some existing text data; recommended character for Jawi gaf is 0762 → U+0762 ݢ Arabic Letter Keheh With Dot Above |
| U+06AD | ڭ | Arabic Letter Ng
Uyghur, Kazakh, Moroccan Arabic, early Jawi, early Persian, ... |
| U+06AE | ڮ | Arabic Letter Kaf With Three Dots Below
Berber, early Persian Pegon alternative for 08B4 |
| U+06AF | گ | Arabic Letter Gaf
Persian, Urdu, ... |
| U+06B0 | ڰ | Arabic Letter Gaf With Ring
Lahnda |
| U+06B1 | ڱ | Arabic Letter Ngoeh
Sindhi |
| U+06B2 | ڲ | Arabic Letter Gaf With Two Dots Below
not used in Sindhi |
| U+06B3 | ڳ | Arabic Letter Gueh
Sindhi, Saraiki |
| U+06B4 | ڴ | Arabic Letter Gaf With Three Dots Above
not used in Sindhi, Karakalpak |
| U+06B5 | ڵ | Arabic Letter Lam With Small V
Kurdish, historically Bosnian |
| U+06B6 | ڶ | Arabic Letter Lam With Dot Above
Kurdish |
| U+06B7 | ڷ | Arabic Letter Lam With Three Dots Above
Kurdish |
| U+06B8 | ڸ | Arabic Letter Lam With Three Dots Below
Avar, Soqotri |
| U+06B9 | ڹ | Arabic Letter Noon With Dot Below |
| U+06BA | ں | Arabic Letter Noon Ghunna
Urdu, archaic Arabic dotless in all four contextual forms |
| U+06BB | ڻ | Arabic Letter Rnoon
dotless in all four contextual forms Sindhi |
| U+06BC | ڼ | Arabic Letter Noon With Ring
Pashto |
| U+06BD | ڽ | Arabic Letter Noon With Three Dots Above
Jawi |
| U+06BE | ھ | Arabic Letter Heh Doachashmee
forms aspirate digraphs in Urdu and other languages of South Asia represents the glottal fricative /h/ in Uyghur |
| U+06BF | ڿ | Arabic Letter Tcheh With Dot Above |
| U+06C0 | ۀ | Arabic Letter Heh With Yeh Above
for ezafe, use 0654 over the language-appropriate base letter actually a ligature, not an independent letter Arabic letter hamzah on ha (1.0) ≡ ۀ U+06D5 U+0654 |
| U+06C1 | ہ | Arabic Letter Heh Goal
Urdu |
| U+06C2 | ۂ | Arabic Letter Heh Goal With Hamza Above
Urdu actually a ligature, not an independent letter ≡ ۂ U+06C1 U+0654 |
| U+06C3 | ۃ | Arabic Letter Teh Marbuta Goal
Urdu |
| U+06C4 | ۄ | Arabic Letter Waw With Ring
Kashmiri |
| U+06C5 | ۅ | Arabic Letter Kirghiz Oe
Kyrgyz a glyph variant occurs which replaces the looped tail with a horizontal bar through the tail |
| U+06C6 | ۆ | Arabic Letter Oe
Uyghur, Kurdish, Kazakh, Azerbaijani, historically Bosnian |
| U+06C7 | ۇ | Arabic Letter U
Azerbaijani, Kazakh, Kyrgyz, Uyghur |
| U+06C8 | ۈ | Arabic Letter Yu
Uyghur |
| U+06C9 | ۉ | Arabic Letter Kirghiz Yu
Kazakh, Kyrgyz, historically Bosnian |
| U+06CA | ۊ | Arabic Letter Waw With Two Dots Above
Kurdish |
| U+06CB | ۋ | Arabic Letter Ve
Uyghur, Kazakh |
| U+06CC | ی | Arabic Letter Farsi Yeh
Arabic, Persian, Urdu, Kashmiri, ... initial and medial forms of this letter have dots → U+0649 ى ARABIC LETTER ALEF MAKSURA → U+064A ي Arabic Letter Yeh |
| U+06CD | ۍ | Arabic Letter Yeh With Tail
Pashto, Sindhi |
| U+06CE | ێ | Arabic Letter Yeh With Small V
Kurdish |
| U+06CF | ۏ | Arabic Letter Waw With Dot Above
Jawi |
| U+06D0 | ې | Arabic Letter E
Pashto, Uyghur used as the letter bbeh in Sindhi |
| U+06D1 | ۑ | Arabic Letter Yeh With Three Dots Below
Mende languages, Hausa |
| U+06D2 | ے | Arabic Letter Yeh Barree
Urdu |
| U+06D3 | ۓ | Arabic Letter Yeh Barree With Hamza Above
Urdu |
| U+06D4 | ۔ | Arabic Full Stop
Urdu |
| U+06D5 | ە | Arabic Letter Ae
Uyghur, Kazakh, Kyrgyz |
| U+06D6 | ۖ | Arabic Small High Ligature Sad With Lam With Alef Maksura |
| U+06D7 | ۗ | Arabic Small High Ligature Qaf With Lam With Alef Maksura |
| U+06D8 | ۘ | Arabic Small High Meem Initial Form |
| U+06D9 | ۙ | Arabic Small High Lam Alef |
| U+06DA | ۚ | Arabic Small High Jeem |
| U+06DB | ۛ | Arabic Small High Three Dots |
| U+06DC | ۜ | Arabic Small High Seen |
| U+06DD | | Arabic End of Ayah |
| U+06DE | ۞ | Arabic Star of Rub El Hizb |
| U+06DF | ۟ | Arabic Small High Rounded Zero
smaller than the typical circular shape used for 0652 |
| U+06E0 | ۠ | Arabic Small High Upright Rectangular Zero
the term "rectangular zero" is a translation of the Arabic name of this sign |
| U+06E1 | ۡ | Arabic Small High Dotless Head Of Khah presentation form of 0652, using font technology to select the variant is preferred
used in some Qurans to mark absence of a vowel= Arabic jazm → U+0652 ْ Arabic Sukun |
| U+06E2 | ۢ | Arabic Small High Meem Isolated Form |
| U+06E3 | ۣ | Arabic Small Low Seen |
| U+06E4 | ۤ | Arabic Small High Madda
typically used with 06E5, 06E6, 06E7, and 08F3 |
| U+06E5 | ۥ | Arabic Small Waw
→ U+08D3 ࣓ Arabic Small Low Waw → U+08F3 ࣳ Arabic Small High Waw |
| U+06E6 | ۦ | Arabic Small Yeh |
| U+06E7 | ۧ | Arabic Small High Yeh |
| U+06E8 | ۨ | Arabic Small High Noon |
| U+06E9 | ۩ | Arabic Place Of Sajdah
there is a range of acceptable glyphs for this character |
| U+06EA | ۪ | Arabic Empty Centre Low Stop |
| U+06EB | ۫ | Arabic Empty Centre High Stop |
| U+06EC | ۬ | Arabic Rounded High Stop With Filled Centre
also used in Quranic text in African and other orthographies to represent wasla, ikhtilas, etc. |
| U+06ED | ۭ | Arabic Small Low Meem |
| U+06EE | ۮ | Arabic Letter Dal With Inverted V |
| U+06EF | ۯ | Arabic Letter Reh With Inverted V
also used in early Persian |
| U+06F0 | ۰ | Extended Arabic-Indic Digit Zero |
| U+06F1 | ۱ | Extended Arabic-Indic Digit One |
| U+06F2 | ۲ | Extended Arabic-Indic Digit Two |
| U+06F3 | ۳ | Extended Arabic-Indic Digit Three |
| U+06F4 | ۴ | Extended Arabic-Indic Digit Four
Persian has a different glyph than Sindhi and Urdu |
| U+06F5 | ۵ | Extended Arabic-Indic Digit Five
Persian, Sindhi, and Urdu share glyph different from Arabic |
| U+06F6 | ۶ | Extended Arabic-Indic Digit Six
Persian, Sindhi, and Urdu have glyphs different from Arabic |
| U+06F7 | ۷ | Extended Arabic-Indic Digit Seven
Urdu and Sindhi have glyphs different from Arabic |
| U+06F8 | ۸ | Extended Arabic-Indic Digit Eight |
| U+06F9 | ۹ | Extended Arabic-Indic Digit Nine |
| U+06FA | ۺ | Arabic Letter Sheen With Dot Below |
| U+06FB | ۻ | Arabic Letter Dad With Dot Below |
| U+06FC | ۼ | Arabic Letter Ghain With Dot Below |
| U+06FD | ۽ | Arabic Sign Sindhi Ampersand |
| U+06FE | ۾ | Arabic Sign Sindhi Postposition Men |
| U+06FF | ۿ | Arabic Letter Heh With Inverted V |
Compact table
[edit]| Arabic[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+060x | | | | | | | ؆ | ؇ | ؈ | ؉ | ؊ | ؋ | ، | ؍ | ؎ | ؏ |
| U+061x | ؐ | ؑ | ؒ | ؓ | ؔ | ؕ | ؖ | ؗ | ؘ | ؙ | ؚ | ؛ | ALM | ؝ | ؞ | ؟ |
| U+062x | ؠ | ء | آ | أ | ؤ | إ | ئ | ا | ب | ة | ت | ث | ج | ح | خ | د |
| U+063x | ذ | ر | ز | س | ش | ص | ض | ط | ظ | ع | غ | ػ | ؼ | ؽ | ؾ | ؿ |
| U+064x | ـ | ف | ق | ك | ل | م | ن | ه | و | ى | ي | ً | ٌ | ٍ | َ | ُ |
| U+065x | ِ | ّ | ْ | ٓ | ٔ | ٕ | ٖ | ٗ | ٘ | ٙ | ٚ | ٛ | ٜ | ٝ | ٞ | ٟ |
| U+066x | ٠ | ١ | ٢ | ٣ | ٤ | ٥ | ٦ | ٧ | ٨ | ٩ | ٪ | ٫ | ٬ | ٭ | ٮ | ٯ |
| U+067x | ٰ | ٱ | ٲ | ٳ | ٴ | ٵ | ٶ | ٷ | ٸ | ٹ | ٺ | ٻ | ټ | ٽ | پ | ٿ |
| U+068x | ڀ | ځ | ڂ | ڃ | ڄ | څ | چ | ڇ | ڈ | ډ | ڊ | ڋ | ڌ | ڍ | ڎ | ڏ |
| U+069x | ڐ | ڑ | ڒ | ړ | ڔ | ڕ | ږ | ڗ | ژ | ڙ | ښ | ڛ | ڜ | ڝ | ڞ | ڟ |
| U+06Ax | ڠ | ڡ | ڢ | ڣ | ڤ | ڥ | ڦ | ڧ | ڨ | ک | ڪ | ګ | ڬ | ڭ | ڮ | گ |
| U+06Bx | ڰ | ڱ | ڲ | ڳ | ڴ | ڵ | ڶ | ڷ | ڸ | ڹ | ں | ڻ | ڼ | ڽ | ھ | ڿ |
| U+06Cx | ۀ | ہ | ۂ | ۃ | ۄ | ۅ | ۆ | ۇ | ۈ | ۉ | ۊ | ۋ | ی | ۍ | ێ | ۏ |
| U+06Dx | ې | ۑ | ے | ۓ | ۔ | ە | ۖ | ۗ | ۘ | ۙ | ۚ | ۛ | ۜ | | ۞ | ۟ |
| U+06Ex | ۠ | ۡ | ۢ | ۣ | ۤ | ۥ | ۦ | ۧ | ۨ | ۩ | ۪ | ۫ | ۬ | ۭ | ۮ | ۯ |
| U+06Fx | ۰ | ۱ | ۲ | ۳ | ۴ | ۵ | ۶ | ۷ | ۸ | ۹ | ۺ | ۻ | ۼ | ۽ | ۾ | ۿ |
| Notes | ||||||||||||||||
Arabic Supplement
[edit]| Arabic Supplement[1] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+075x | ݐ | ݑ | ݒ | ݓ | ݔ | ݕ | ݖ | ݗ | ݘ | ݙ | ݚ | ݛ | ݜ | ݝ | ݞ | ݟ |
| U+076x | ݠ | ݡ | ݢ | ݣ | ݤ | ݥ | ݦ | ݧ | ݨ | ݩ | ݪ | ݫ | ݬ | ݭ | ݮ | ݯ |
| U+077x | ݰ | ݱ | ݲ | ݳ | ݴ | ݵ | ݶ | ݷ | ݸ | ݹ | ݺ | ݻ | ݼ | ݽ | ݾ | ݿ |
Notes
| ||||||||||||||||
Arabic Extended-B
[edit]| Arabic Extended-B[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+087x | ࡰ | ࡱ | ࡲ | ࡳ | ࡴ | ࡵ | ࡶ | ࡷ | ࡸ | ࡹ | ࡺ | ࡻ | ࡼ | ࡽ | ࡾ | ࡿ |
| U+088x | ࢀ | ࢁ | ࢂ | ࢃ | ࢄ | ࢅ | ࢆ | ࢇ | ࢈ | ࢉ | ࢊ | ࢋ | ࢌ | ࢍ | ࢎ | |
| U+089x | | | | ࢘ | ࢙ | ࢚ | ࢛ | ࢜ | ࢝ | ࢞ | ࢟ | |||||
| Notes | ||||||||||||||||
Arabic Extended-A
[edit]| Arabic Extended-A[1] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+08Ax | ࢠ | ࢡ | ࢢ | ࢣ | ࢤ | ࢥ | ࢦ | ࢧ | ࢨ | ࢩ | ࢪ | ࢫ | ࢬ | ࢭ | ࢮ | ࢯ |
| U+08Bx | ࢰ | ࢱ | ࢲ | ࢳ | ࢴ | ࢵ | ࢶ | ࢷ | ࢸ | ࢹ | ࢺ | ࢻ | ࢼ | ࢽ | ࢾ | ࢿ |
| U+08Cx | ࣀ | ࣁ | ࣂ | ࣃ | ࣄ | ࣅ | ࣆ | ࣇ | ࣈ | ࣉ | ࣊ | ࣋ | ࣌ | ࣍ | ࣎ | ࣏ |
| U+08Dx | ࣐ | ࣑ | ࣒ | ࣓ | ࣔ | ࣕ | ࣖ | ࣗ | ࣘ | ࣙ | ࣚ | ࣛ | ࣜ | ࣝ | ࣞ | ࣟ |
| U+08Ex | ࣠ | ࣡ | | ࣣ | ࣤ | ࣥ | ࣦ | ࣧ | ࣨ | ࣩ | ࣪ | ࣫ | ࣬ | ࣭ | ࣮ | ࣯ |
| U+08Fx | ࣰ | ࣱ | ࣲ | ࣳ | ࣴ | ࣵ | ࣶ | ࣷ | ࣸ | ࣹ | ࣺ | ࣻ | ࣼ | ࣽ | ࣾ | ࣿ |
Notes
| ||||||||||||||||
Arabic Presentation Forms A
[edit]They are mostly ligatures which can be created from the previous charts' characters, with the exception of the bracket-like graphemes ﴾ ﴿ and some of them are ligatures of common liturgical phrases.
| Arabic Presentation Forms-A[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+FB5x | ﭐ | ﭑ | ﭒ | ﭓ | ﭔ | ﭕ | ﭖ | ﭗ | ﭘ | ﭙ | ﭚ | ﭛ | ﭜ | ﭝ | ﭞ | ﭟ |
| U+FB6x | ﭠ | ﭡ | ﭢ | ﭣ | ﭤ | ﭥ | ﭦ | ﭧ | ﭨ | ﭩ | ﭪ | ﭫ | ﭬ | ﭭ | ﭮ | ﭯ |
| U+FB7x | ﭰ | ﭱ | ﭲ | ﭳ | ﭴ | ﭵ | ﭶ | ﭷ | ﭸ | ﭹ | ﭺ | ﭻ | ﭼ | ﭽ | ﭾ | ﭿ |
| U+FB8x | ﮀ | ﮁ | ﮂ | ﮃ | ﮄ | ﮅ | ﮆ | ﮇ | ﮈ | ﮉ | ﮊ | ﮋ | ﮌ | ﮍ | ﮎ | ﮏ |
| U+FB9x | ﮐ | ﮑ | ﮒ | ﮓ | ﮔ | ﮕ | ﮖ | ﮗ | ﮘ | ﮙ | ﮚ | ﮛ | ﮜ | ﮝ | ﮞ | ﮟ |
| U+FBAx | ﮠ | ﮡ | ﮢ | ﮣ | ﮤ | ﮥ | ﮦ | ﮧ | ﮨ | ﮩ | ﮪ | ﮫ | ﮬ | ﮭ | ﮮ | ﮯ |
| U+FBBx | ﮰ | ﮱ | ﮲ | ﮳ | ﮴ | ﮵ | ﮶ | ﮷ | ﮸ | ﮹ | ﮺ | ﮻ | ﮼ | ﮽ | ﮾ | ﮿ |
| U+FBCx | ﯀ | ﯁ | ﯂ | | | | | | | | | | | | | |
| U+FBDx | | | | ﯓ | ﯔ | ﯕ | ﯖ | ﯗ | ﯘ | ﯙ | ﯚ | ﯛ | ﯜ | ﯝ | ﯞ | ﯟ |
| U+FBEx | ﯠ | ﯡ | ﯢ | ﯣ | ﯤ | ﯥ | ﯦ | ﯧ | ﯨ | ﯩ | ﯪ | ﯫ | ﯬ | ﯭ | ﯮ | ﯯ |
| U+FBFx | ﯰ | ﯱ | ﯲ | ﯳ | ﯴ | ﯵ | ﯶ | ﯷ | ﯸ | ﯹ | ﯺ | ﯻ | ﯼ | ﯽ | ﯾ | ﯿ |
| U+FC0x | ﰀ | ﰁ | ﰂ | ﰃ | ﰄ | ﰅ | ﰆ | ﰇ | ﰈ | ﰉ | ﰊ | ﰋ | ﰌ | ﰍ | ﰎ | ﰏ |
| U+FC1x | ﰐ | ﰑ | ﰒ | ﰓ | ﰔ | ﰕ | ﰖ | ﰗ | ﰘ | ﰙ | ﰚ | ﰛ | ﰜ | ﰝ | ﰞ | ﰟ |
| U+FC2x | ﰠ | ﰡ | ﰢ | ﰣ | ﰤ | ﰥ | ﰦ | ﰧ | ﰨ | ﰩ | ﰪ | ﰫ | ﰬ | ﰭ | ﰮ | ﰯ |
| U+FC3x | ﰰ | ﰱ | ﰲ | ﰳ | ﰴ | ﰵ | ﰶ | ﰷ | ﰸ | ﰹ | ﰺ | ﰻ | ﰼ | ﰽ | ﰾ | ﰿ |
| U+FC4x | ﱀ | ﱁ | ﱂ | ﱃ | ﱄ | ﱅ | ﱆ | ﱇ | ﱈ | ﱉ | ﱊ | ﱋ | ﱌ | ﱍ | ﱎ | ﱏ |
| U+FC5x | ﱐ | ﱑ | ﱒ | ﱓ | ﱔ | ﱕ | ﱖ | ﱗ | ﱘ | ﱙ | ﱚ | ﱛ | ﱜ | ﱝ | ﱞ | ﱟ |
| U+FC6x | ﱠ | ﱡ | ﱢ | ﱣ | ﱤ | ﱥ | ﱦ | ﱧ | ﱨ | ﱩ | ﱪ | ﱫ | ﱬ | ﱭ | ﱮ | ﱯ |
| U+FC7x | ﱰ | ﱱ | ﱲ | ﱳ | ﱴ | ﱵ | ﱶ | ﱷ | ﱸ | ﱹ | ﱺ | ﱻ | ﱼ | ﱽ | ﱾ | ﱿ |
| U+FC8x | ﲀ | ﲁ | ﲂ | ﲃ | ﲄ | ﲅ | ﲆ | ﲇ | ﲈ | ﲉ | ﲊ | ﲋ | ﲌ | ﲍ | ﲎ | ﲏ |
| U+FC9x | ﲐ | ﲑ | ﲒ | ﲓ | ﲔ | ﲕ | ﲖ | ﲗ | ﲘ | ﲙ | ﲚ | ﲛ | ﲜ | ﲝ | ﲞ | ﲟ |
| U+FCAx | ﲠ | ﲡ | ﲢ | ﲣ | ﲤ | ﲥ | ﲦ | ﲧ | ﲨ | ﲩ | ﲪ | ﲫ | ﲬ | ﲭ | ﲮ | ﲯ |
| U+FCBx | ﲰ | ﲱ | ﲲ | ﲳ | ﲴ | ﲵ | ﲶ | ﲷ | ﲸ | ﲹ | ﲺ | ﲻ | ﲼ | ﲽ | ﲾ | ﲿ |
| U+FCCx | ﳀ | ﳁ | ﳂ | ﳃ | ﳄ | ﳅ | ﳆ | ﳇ | ﳈ | ﳉ | ﳊ | ﳋ | ﳌ | ﳍ | ﳎ | ﳏ |
| U+FCDx | ﳐ | ﳑ | ﳒ | ﳓ | ﳔ | ﳕ | ﳖ | ﳗ | ﳘ | ﳙ | ﳚ | ﳛ | ﳜ | ﳝ | ﳞ | ﳟ |
| U+FCEx | ﳠ | ﳡ | ﳢ | ﳣ | ﳤ | ﳥ | ﳦ | ﳧ | ﳨ | ﳩ | ﳪ | ﳫ | ﳬ | ﳭ | ﳮ | ﳯ |
| U+FCFx | ﳰ | ﳱ | ﳲ | ﳳ | ﳴ | ﳵ | ﳶ | ﳷ | ﳸ | ﳹ | ﳺ | ﳻ | ﳼ | ﳽ | ﳾ | ﳿ |
| U+FD0x | ﴀ | ﴁ | ﴂ | ﴃ | ﴄ | ﴅ | ﴆ | ﴇ | ﴈ | ﴉ | ﴊ | ﴋ | ﴌ | ﴍ | ﴎ | ﴏ |
| U+FD1x | ﴐ | ﴑ | ﴒ | ﴓ | ﴔ | ﴕ | ﴖ | ﴗ | ﴘ | ﴙ | ﴚ | ﴛ | ﴜ | ﴝ | ﴞ | ﴟ |
| U+FD2x | ﴠ | ﴡ | ﴢ | ﴣ | ﴤ | ﴥ | ﴦ | ﴧ | ﴨ | ﴩ | ﴪ | ﴫ | ﴬ | ﴭ | ﴮ | ﴯ |
| U+FD3x | ﴰ | ﴱ | ﴲ | ﴳ | ﴴ | ﴵ | ﴶ | ﴷ | ﴸ | ﴹ | ﴺ | ﴻ | ﴼ | ﴽ | ﴾ | ﴿ |
| U+FD4x | ﵀ | ﵁ | ﵂ | ﵃ | ﵄ | ﵅ | ﵆ | ﵇ | ﵈ | ﵉ | ﵊ | ﵋ | ﵌ | ﵍ | ﵎ | ﵏ |
| U+FD5x | ﵐ | ﵑ | ﵒ | ﵓ | ﵔ | ﵕ | ﵖ | ﵗ | ﵘ | ﵙ | ﵚ | ﵛ | ﵜ | ﵝ | ﵞ | ﵟ |
| U+FD6x | ﵠ | ﵡ | ﵢ | ﵣ | ﵤ | ﵥ | ﵦ | ﵧ | ﵨ | ﵩ | ﵪ | ﵫ | ﵬ | ﵭ | ﵮ | ﵯ |
| U+FD7x | ﵰ | ﵱ | ﵲ | ﵳ | ﵴ | ﵵ | ﵶ | ﵷ | ﵸ | ﵹ | ﵺ | ﵻ | ﵼ | ﵽ | ﵾ | ﵿ |
| U+FD8x | ﶀ | ﶁ | ﶂ | ﶃ | ﶄ | ﶅ | ﶆ | ﶇ | ﶈ | ﶉ | ﶊ | ﶋ | ﶌ | ﶍ | ﶎ | ﶏ |
| U+FD9x | | | ﶒ | ﶓ | ﶔ | ﶕ | ﶖ | ﶗ | ﶘ | ﶙ | ﶚ | ﶛ | ﶜ | ﶝ | ﶞ | ﶟ |
| U+FDAx | ﶠ | ﶡ | ﶢ | ﶣ | ﶤ | ﶥ | ﶦ | ﶧ | ﶨ | ﶩ | ﶪ | ﶫ | ﶬ | ﶭ | ﶮ | ﶯ |
| U+FDBx | ﶰ | ﶱ | ﶲ | ﶳ | ﶴ | ﶵ | ﶶ | ﶷ | ﶸ | ﶹ | ﶺ | ﶻ | ﶼ | ﶽ | ﶾ | ﶿ |
| U+FDCx | ﷀ | ﷁ | ﷂ | ﷃ | ﷄ | ﷅ | ﷆ | ﷇ | | | | | | | | ﷏ |
| U+FDDx | ||||||||||||||||
| U+FDEx | ||||||||||||||||
| U+FDFx | ﷰ | ﷱ | ﷲ | ﷳ | ﷴ | ﷵ | ﷶ | ﷷ | ﷸ | ﷹ | ﷺ | ﷻ | ﷼ | ﷽ | ﷾ | ﷿ |
Notes
| ||||||||||||||||
Arabic Presentation Forms B
[edit]These can all be created from the basic chart's characters.
| Arabic Presentation Forms-B[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+FE7x | ﹰ | ﹱ | ﹲ | ﹳ | ﹴ | ﹶ | ﹷ | ﹸ | ﹹ | ﹺ | ﹻ | ﹼ | ﹽ | ﹾ | ﹿ | |
| U+FE8x | ﺀ | ﺁ | ﺂ | ﺃ | ﺄ | ﺅ | ﺆ | ﺇ | ﺈ | ﺉ | ﺊ | ﺋ | ﺌ | ﺍ | ﺎ | ﺏ |
| U+FE9x | ﺐ | ﺑ | ﺒ | ﺓ | ﺔ | ﺕ | ﺖ | ﺗ | ﺘ | ﺙ | ﺚ | ﺛ | ﺜ | ﺝ | ﺞ | ﺟ |
| U+FEAx | ﺠ | ﺡ | ﺢ | ﺣ | ﺤ | ﺥ | ﺦ | ﺧ | ﺨ | ﺩ | ﺪ | ﺫ | ﺬ | ﺭ | ﺮ | ﺯ |
| U+FEBx | ﺰ | ﺱ | ﺲ | ﺳ | ﺴ | ﺵ | ﺶ | ﺷ | ﺸ | ﺹ | ﺺ | ﺻ | ﺼ | ﺽ | ﺾ | ﺿ |
| U+FECx | ﻀ | ﻁ | ﻂ | ﻃ | ﻄ | ﻅ | ﻆ | ﻇ | ﻈ | ﻉ | ﻊ | ﻋ | ﻌ | ﻍ | ﻎ | ﻏ |
| U+FEDx | ﻐ | ﻑ | ﻒ | ﻓ | ﻔ | ﻕ | ﻖ | ﻗ | ﻘ | ﻙ | ﻚ | ﻛ | ﻜ | ﻝ | ﻞ | ﻟ |
| U+FEEx | ﻠ | ﻡ | ﻢ | ﻣ | ﻤ | ﻥ | ﻦ | ﻧ | ﻨ | ﻩ | ﻪ | ﻫ | ﻬ | ﻭ | ﻮ | ﻯ |
| U+FEFx | ﻰ | ﻱ | ﻲ | ﻳ | ﻴ | ﻵ | ﻶ | ﻷ | ﻸ | ﻹ | ﻺ | ﻻ | ﻼ | ZW NBSP | ||
| Notes | ||||||||||||||||
Rumi Numeral Symbols
[edit]| Rumi Numeral Symbols[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+10E6x | 𐹠 | 𐹡 | 𐹢 | 𐹣 | 𐹤 | 𐹥 | 𐹦 | 𐹧 | 𐹨 | 𐹩 | 𐹪 | 𐹫 | 𐹬 | 𐹭 | 𐹮 | 𐹯 |
| U+10E7x | 𐹰 | 𐹱 | 𐹲 | 𐹳 | 𐹴 | 𐹵 | 𐹶 | 𐹷 | 𐹸 | 𐹹 | 𐹺 | 𐹻 | 𐹼 | 𐹽 | 𐹾 | |
| Notes | ||||||||||||||||
Arabic Extended-C
[edit]| Arabic Extended-C[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+10ECx | | | | | | | ||||||||||
| U+10EDx | | | | | | | | | | |||||||
| U+10EEx | ||||||||||||||||
| U+10EFx | | | | 𐻽 | 𐻾 | 𐻿 | ||||||||||
| Notes | ||||||||||||||||
Indic Siyaq Numbers
[edit]| Indic Siyaq Numbers[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+1EC7x | 𞱱 | 𞱲 | 𞱳 | 𞱴 | 𞱵 | 𞱶 | 𞱷 | 𞱸 | 𞱹 | 𞱺 | 𞱻 | 𞱼 | 𞱽 | 𞱾 | 𞱿 | |
| U+1EC8x | 𞲀 | 𞲁 | 𞲂 | 𞲃 | 𞲄 | 𞲅 | 𞲆 | 𞲇 | 𞲈 | 𞲉 | 𞲊 | 𞲋 | 𞲌 | 𞲍 | 𞲎 | 𞲏 |
| U+1EC9x | 𞲐 | 𞲑 | 𞲒 | 𞲓 | 𞲔 | 𞲕 | 𞲖 | 𞲗 | 𞲘 | 𞲙 | 𞲚 | 𞲛 | 𞲜 | 𞲝 | 𞲞 | 𞲟 |
| U+1ECAx | 𞲠 | 𞲡 | 𞲢 | 𞲣 | 𞲤 | 𞲥 | 𞲦 | 𞲧 | 𞲨 | 𞲩 | 𞲪 | 𞲫 | 𞲬 | 𞲭 | 𞲮 | 𞲯 |
| U+1ECBx | 𞲰 | 𞲱 | 𞲲 | 𞲳 | 𞲴 | |||||||||||
| Notes | ||||||||||||||||
Ottoman Siyaq Numbers
[edit]| Ottoman Siyaq Numbers[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+1ED0x | 𞴁 | 𞴂 | 𞴃 | 𞴄 | 𞴅 | 𞴆 | 𞴇 | 𞴈 | 𞴉 | 𞴊 | 𞴋 | 𞴌 | 𞴍 | 𞴎 | 𞴏 | |
| U+1ED1x | 𞴐 | 𞴑 | 𞴒 | 𞴓 | 𞴔 | 𞴕 | 𞴖 | 𞴗 | 𞴘 | 𞴙 | 𞴚 | 𞴛 | 𞴜 | 𞴝 | 𞴞 | 𞴟 |
| U+1ED2x | 𞴠 | 𞴡 | 𞴢 | 𞴣 | 𞴤 | 𞴥 | 𞴦 | 𞴧 | 𞴨 | 𞴩 | 𞴪 | 𞴫 | 𞴬 | 𞴭 | 𞴮 | 𞴯 |
| U+1ED3x | 𞴰 | 𞴱 | 𞴲 | 𞴳 | 𞴴 | 𞴵 | 𞴶 | 𞴷 | 𞴸 | 𞴹 | 𞴺 | 𞴻 | 𞴼 | 𞴽 | ||
| U+1ED4x | ||||||||||||||||
| Notes | ||||||||||||||||
Arabic Mathematical Alphabetic Symbols
[edit]| Arabic Mathematical Alphabetic Symbols[1][2] Official Unicode Consortium code chart (PDF) | ||||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
| U+1EE0x | 𞸀 | 𞸁 | 𞸂 | 𞸃 | 𞸅 | 𞸆 | 𞸇 | 𞸈 | 𞸉 | 𞸊 | 𞸋 | 𞸌 | 𞸍 | 𞸎 | 𞸏 | |
| U+1EE1x | 𞸐 | 𞸑 | 𞸒 | 𞸓 | 𞸔 | 𞸕 | 𞸖 | 𞸗 | 𞸘 | 𞸙 | 𞸚 | 𞸛 | 𞸜 | 𞸝 | 𞸞 | 𞸟 |
| U+1EE2x | 𞸡 | 𞸢 | 𞸤 | 𞸧 | 𞸩 | 𞸪 | 𞸫 | 𞸬 | 𞸭 | 𞸮 | 𞸯 | |||||
| U+1EE3x | 𞸰 | 𞸱 | 𞸲 | 𞸴 | 𞸵 | 𞸶 | 𞸷 | 𞸹 | 𞸻 | |||||||
| U+1EE4x | 𞹂 | 𞹇 | 𞹉 | 𞹋 | 𞹍 | 𞹎 | 𞹏 | |||||||||
| U+1EE5x | 𞹑 | 𞹒 | 𞹔 | 𞹗 | 𞹙 | 𞹛 | 𞹝 | 𞹟 | ||||||||
| U+1EE6x | 𞹡 | 𞹢 | 𞹤 | 𞹧 | 𞹨 | 𞹩 | 𞹪 | 𞹬 | 𞹭 | 𞹮 | 𞹯 | |||||
| U+1EE7x | 𞹰 | 𞹱 | 𞹲 | 𞹴 | 𞹵 | 𞹶 | 𞹷 | 𞹹 | 𞹺 | 𞹻 | 𞹼 | 𞹾 | ||||
| U+1EE8x | 𞺀 | 𞺁 | 𞺂 | 𞺃 | 𞺄 | 𞺅 | 𞺆 | 𞺇 | 𞺈 | 𞺉 | 𞺋 | 𞺌 | 𞺍 | 𞺎 | 𞺏 | |
| U+1EE9x | 𞺐 | 𞺑 | 𞺒 | 𞺓 | 𞺔 | 𞺕 | 𞺖 | 𞺗 | 𞺘 | 𞺙 | 𞺚 | 𞺛 | ||||
| U+1EEAx | 𞺡 | 𞺢 | 𞺣 | 𞺥 | 𞺦 | 𞺧 | 𞺨 | 𞺩 | 𞺫 | 𞺬 | 𞺭 | 𞺮 | 𞺯 | |||
| U+1EEBx | 𞺰 | 𞺱 | 𞺲 | 𞺳 | 𞺴 | 𞺵 | 𞺶 | 𞺷 | 𞺸 | 𞺹 | 𞺺 | 𞺻 | ||||
| U+1EECx | ||||||||||||||||
| U+1EEDx | ||||||||||||||||
| U+1EEEx | ||||||||||||||||
| U+1EEFx | 𞻰 | 𞻱 | ||||||||||||||
| Notes | ||||||||||||||||
References
[edit]- ^ "What is the origin of the ampersand (&)?"
- ^ unicode.org Biography: Thomas Milo - DecoType
- ^ "UAX #24: Script data file". Unicode Character Database. The Unicode Consortium.
- ^ a b "Arabic, Arabic Presentation Forms-B". The Unicode Standard. The Unicode Consortium. September 2025.
- ^ Pandey, Anshuman (2015-11-05). "L2/15-121R2: Proposal to Encode Indic Siyaq Numbers" (PDF).
- ^ a b "Chapter 22: Symbols". Unicode, Inc. September 2024.
- ^ Deprecated as of Unicode version 6.0 UCD Change History "The particular combination of an alef with this vowel mark should be written with the sequence <U+0627 ARABIC LETTER ALEF, U+065F ARABIC WAVY HAMZA BELOW>, rather than with the character U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW, which has been deprecated and which is not canonically equivalent. "Section 9.2: Arabic, Additional Vowel Marks". The Unicode Standard. The Unicode Consortium. September 2025.
External links
[edit]- Oibane. "Unicode problems". Arabic on Linux. Archived from the original on 2008-02-03.
- Arabunic. "Arabunic : unicode <-> glyphs, 2 way converter". Java applet that convert glyphs to unicode (and unicode to glyphs). It accounts for ligatures, lam-alif, diacritics, etc.
- Scheherazade or Scheherazade New, an extended Arabic script font designed by SIL International, distributed under the SIL Open Font License (OFL)
- Harmattan, an extended Arabic script font designed by SIL International for West Africa, distributed under the SIL Open Font License (OFL)
Arabic script in Unicode
View on GrokipediaIntroduction
Overview
The Arabic script is a right-to-left abjad characterized by 28 basic letters that primarily represent consonants, with vowels often implied or indicated by optional diacritics. It accommodates variants for numerous languages beyond Arabic, including Persian (with four additional letters like پ for /p/), Urdu (adding letters like ڑ for retroflex sounds), and African scripts such as those used in Hausa, Fulfulde, and Wolof. This script's cursive nature requires letters to connect in specific ways depending on their position in a word, a feature that Unicode supports through advanced rendering mechanisms rather than precomposed forms for all combinations. Unicode adopts an atomic encoding model for the Arabic script, assigning unique code points to base letters while treating diacritics—such as the short vowel marks fatha, kasra, and damma—as independent combining characters that overlay the base. Contextual glyph shaping, which selects among a letter's four potential forms (initial, medial, final, or isolated), is performed by font rendering engines using the OpenType specification or similar technologies, ensuring flexibility across diverse writing systems. In Unicode 17.0, the Arabic script encompasses over 1,400 characters distributed across 11 blocks, enabling comprehensive coverage for Modern Standard Arabic orthography, intricate Qur'anic annotations (including specialized tatweel and ornamentation), historical numeral systems like Eastern Arabic and Rumi variants, and mathematical symbols tailored for Arabic-based expressions. Bidirectional text processing is integral to this encoding, with the Unicode Bidirectional Algorithm (UBA) resolving the visual order of mixed right-to-left Arabic and left-to-right Latin scripts by analyzing embedding levels and character directions. Notably, certain legacy characters have been deprecated to promote consistent usage; for instance, U+0673 (Arabic letter high U) was deprecated starting with Unicode 6.0 in favor of compositional alternatives using standard hamza and base forms. This approach underscores Unicode's emphasis on stability and interoperability for Arabic digital representation.Historical Development
The encoding of the Arabic script in Unicode began with the adoption of the ISO/IEC 8859-6 standard, published in 1987, which defined a single-byte character set for basic Arabic letters, numerals, and diacritics used in the Arabic language.[7] This framework was directly incorporated into Unicode Version 1.0, released in October 1991, establishing the core Arabic block (U+0600–U+06FF) with 144 characters to support essential script elements while aligning with emerging international standards for multilingual text processing.[2] The inclusion addressed immediate needs for digital representation of Arabic in computing environments, drawing from the ISO standard's logical ordering and bidi-directional properties. Early expansions focused on rendering complexities inherent to the cursive Arabic script. Unicode Version 1.1 (June 1993) introduced the Arabic Presentation Forms-A (U+FB50–U+FDFF) and Arabic Presentation Forms-B (U+FE70–FEFF) blocks, encoding precomposed ligatures and contextual glyph variants to facilitate compatibility with legacy systems and initial text layout requirements. These additions mitigated challenges in displaying joined forms without advanced shaping engines, a key hurdle for the script's right-to-left, context-dependent behavior. Further growth came in Unicode 4.0 (October 2005), which added the Arabic Supplement block (U+0750–U+077F) for variant letters in non-Arabic languages, such as those in Central Asia and Africa, expanding the repertoire beyond classical Arabic orthography. Subsequent versions emphasized support for regional and historical usages. Unicode 5.0 (July 2006) established the Arabic Extended-A block (U+08A0–U+08FF), incorporating 59 characters for Qur'anic annotations, African language variants, and extensions like Berber script modifications, proposed in part by linguistic experts to preserve orthographic traditions.[3] By Unicode 11.0 (June 2018), historical numeral systems were integrated through the Rumi Numeral Symbols (U+10E60–U+10E7F) for Ottoman-era notations and Indic Siyaq Numbers (U+1EC00–U+1EC4F) for Mughal accounting practices in South Asia, addressing the need to encode non-positional, cursive numeral forms used in archival documents. Later milestones tackled specialized domains and further extensions. Unicode 12.0 (March 2019) introduced Arabic Extended-B (U+0870–U+089F) with 48 characters for additional African scripts, the Ottoman Siyaq Numbers block (U+1ED00–U+1ED4F) extending Siyaq support to Turkish variants, and Arabic Mathematical Alphabetic Symbols (U+1EE00–U+1EEFF) providing bold, italic, and styled forms for mathematical typography while avoiding shaping dependencies.[8][9] These addressed encoding models for complex joining in historical numerals, as analyzed in proposals evaluating Siyaq systems for plain-text representation.[10] Unicode 15.0 (September 2022) added Arabic Extended-C (U+10EC0–U+10EFF), including three Qur'anic annotation characters to refine pronunciation marking in religious texts.[11] Key contributions came from industry and academic sources. Microsoft played a foundational role in early Unicode development, including Arabic glyph integration during Version 1.0 reviews, and later advanced shaping through OpenType specifications for bidirectional rendering.[12][13] SIL International submitted influential proposals, such as document L2/10-288R in 2010, advocating 35 characters for African (e.g., Fulah, Hausa) and Asian (e.g., Rohingya) languages, which informed additions in Extended-A and subsequent blocks like Extended-B in 2019.[14] Unicode 16.0 (September 2024) added characters to the core Arabic block for early Persian and Azerbaijani orthographies, such as variant keheh forms. Unicode 17.0 (September 2025) added characters to Arabic Extended-C, including three Pegon letters (e.g., U+10EC2 ARABIC LETTER DAL WITH TWO DOTS VERTICALLY BELOW) for Javanese and Southeast Asian Islamic texts, as well as additional Qur'anic marks like U+10EFC ARABIC COMBINING ALEF OVERLAY (used in Libya) and U+10EFF ARABIC SMALL LOW WORD MADDA (used in Turkey).[15] These updates enhanced support for regional variants, including tone marks and overlays for non-Arabic phonologies. As the latest version as of November 2025, Unicode 17.0 encompasses over 1,400 Arabic-script code points across multiple blocks, with ongoing proposals targeting additional African and Southeast Asian variants to complete coverage of living orthographies.[16][17]Encoding Fundamentals
Contextual Forms and Glyph Shaping
The Arabic script is cursive, meaning its letters change shape depending on their position within a word: isolated (standalone or non-joining contexts), initial (at the beginning of a word or after a non-joining letter), medial (in the middle, connecting to both preceding and following letters), and final (at the end of a word or before a non-joining letter).[1] For example, the letter beh (U+0628) appears as isolated ب, final ـب, initial بـ, or medial ـبـ, with the appropriate glyph selected dynamically during rendering rather than encoded as separate Unicode characters.[2] These four positional forms ensure the script's connected appearance while allowing flexibility for non-joining letters like alef (ا, U+0627), which only joins to the left and thus typically takes an isolated or initial form depending on context.[1] Unicode defines shaping through the Joining_Type and Joining_Group properties, with 22 joining groups for standard Arabic letters (such as ALEF, BEH, LAM, and MEEM) that determine compatibility for connections between adjacent characters.[18] Joining types include Dual_Joining (connects both sides, e.g., beh), Right_Joining (connects only left, e.g., alef), Left_Joining (connects only right, e.g., dal), and Non_Joining (no connections, e.g., hamza); mandatory joining occurs between compatible dual or right/left pairs, while optional joining may apply in stylistic contexts like certain ligatures.[1] These properties follow seven conformance rules (R1–R7) for determining forms: transparent characters (like diacritics) are ignored (R1), right-joining letters take initial forms before join-causing characters (R2), and dual-joining letters adopt medial, initial, or final forms based on adjacent joiners (R4–R6), defaulting to isolated otherwise (R7).[1] Rendering relies on layout engines that analyze Unicode code points and apply these rules dynamically, without precomposed positional glyphs in the core encoding; instead, base characters (U+0600–U+06FF) are substituted via OpenType font features in GSUB tables (e.g., 'init' for initial, 'medi' for medial) and positioned with GPOS (e.g., 'curs' for cursive attachments).[13] Libraries like HarfBuzz implement this process, processing text through stages of script analysis, glyph substitution, and positioning to produce the correct cursive flow. For instance, the Modern Standard Arabic word "kitab" (كتاب, logical order: U+0643 U+062A U+0627 U+0628) results in kaf taking an initial form (كـ, joining right to ta), ta a final form (ت, joining left to kaf but not right to alef), alef isolated (ا, right-joining but no right joiner), and ba final (ـب, at word end).[1][19] Extended languages build on these behaviors with additional joining groups; for Persian, the farsi yeh (U+06CC) belongs to the FARSI_YEH group, enabling distinct medial and final forms (ـیـ and ـی) that differ from standard yeh (ـيـ and ـي).[18] In Urdu, similar extensions apply, such as retroflex letters joining like dual forms, supported through the same OpenType features but with language-specific tags (e.g., 'URD ' for Urdu) to select appropriate alternates.[13] This ensures compatibility across variants while maintaining Unicode's unified base encoding.[19]Diacritics and Combining Characters
The Arabic script employs a system of diacritics known as harakat or tashkil to indicate short vowels and other phonetic features, which are encoded in Unicode primarily as combining characters that attach to base letters.[2] These marks are essential for precise pronunciation in educational texts, religious scriptures, and classical literature, where they overlay consonants without altering the letter's positional form. Basic diacritics include the fatha (U+064E َ, representing a short /a/ sound), damma (U+064F ُ, for /u/), kasra (U+0650 ِ, for /i/), shadda (U+0651 ّ, doubling the consonant), and sukun (U+0652 ْ, indicating no vowel).[2] Tanween forms, which denote indefinite nouns with nunation, are similarly encoded as combining marks: fathatan (U+064B ً), dammatan (U+064C ٌ), and kasratan (U+064D ٍ).[2] Multiple diacritics can stack on a single base letter, following specific canonical combining classes (CCC) to ensure correct vertical ordering during rendering. Arabic diacritics are assigned CCC values such as 27 for tanween above (e.g., fathatan), 220 for marks below the base, and 230 for certain high annotations, with vowels typically in classes 30–32 and shadda in 33.[5] The Unicode Arabic Mark Transient Reordering Algorithm (AMTRA) reorders these marks from inside-out, positioning shadda closest to the base followed by modifier marks, then vowels, to handle complex stacks like a letter with shadda and fatha (e.g., <U+0628, U+0651, U+064E> for بَـّ).[5] In intricate cases, such as overriding default ordering for Qur'anic variants, the combining grapheme joiner (CGJ, U+034F) may be inserted to preserve sequence integrity without visual effect.[5] Extended diacritics support specialized orthographies and annotations, particularly in religious texts. Qur'anic annotations include the small high meem isolated form (U+06E2 ۢ, CCC=230, used for emphasis in recitation) and other signs like the small low seen (U+06E3 ۣ).[2] In Arabic Extended-A (U+08A0–U+08FF), additional marks cater to Warsh orthography variants prevalent in North and West African traditions, such as the Arabic sukun below (U+08D0 ࣐) and small low waw (U+08D3 ࣓), alongside annotations like the small high word sah (U+08CC ࣌, a pause sign).[3] These extended marks integrate with basic tashkil to denote regional pronunciations or scriptural nuances, such as in the Warsh reading of the Qur'an.[3] Tashkil usage exemplifies the system's flexibility; for instance, the word "kitab" (book) is encoded as <U+0643, U+0650, U+062A, U+064E, U+0627, U+0628> (كِتَاب), where kasra and fatha provide vowel guidance.[5] While diacritics attach to base letters independently of their contextual shapes (initial, medial, final, or isolated), rendering engines apply them post-shaping for accurate placement.[5] This separation ensures compatibility across fonts and layouts, though proper display requires support for Arabic-specific combining classes.[5]Presentation and Compatibility Forms
The presentation and compatibility forms for the Arabic script in Unicode consist of precomposed characters that encode specific glyph shapes and ligatures, primarily to support interoperability with legacy encoding standards and font systems predating advanced text shaping capabilities. These forms address the cursive nature of Arabic by providing fixed representations of contextual variants, which were necessary in environments like early DOS codepages or 8-bit encodings that could not dynamically adjust glyphs based on adjacent characters.[20][21] The Arabic Presentation Forms-A block (U+FB50–U+FDFF) focuses on ligatures and contextual glyphs tailored for particular languages, scripts, or religious contexts, such as those in Persian, Urdu, or Islamic honorifics. For instance, it includes U+FDF2ARABIC LIGATURE ALLAH ISOLATED FORM ﷲ, which combines multiple base letters into a single precomposed character for traditional rendering. These characters were encoded to round-trip data from standards requiring such fixed forms, but they are not intended for new content creation.[6][22]
In contrast, the Arabic Presentation Forms-B block (U+FE70–U+FEFF) provides individual positional forms for core Arabic letters, capturing isolated, initial, medial, and final shapes, along with spacing versions of diacritics and certain lam-alef ligatures. An example is U+FE8B ARABIC LETTER BEH INITIAL FORM ﺋ, which represents the beh letter in its initial position within a word. This block supports compatibility with legacy systems by offering direct mappings to pre-shaped glyphs in fonts without built-in cursive joining logic.[20][23]
Despite their utility for backward compatibility, these presentation forms are discouraged in contemporary usage because they duplicate functionality achievable through modern rendering techniques, such as OpenType's Glyph Substitution (GSUB) tables applied to base characters from the Arabic block (U+0600–U+06FF). Relying on precomposed forms can lead to inefficiencies in storage, searchability, and extensibility, as they bypass dynamic shaping that adapts to font features, language variations, and complex sequences. Unicode explicitly recommends encoding text with base letters and marks, then using shaping engines for glyph selection to ensure robust, future-proof representation.[24][5]
For migrating legacy content that employs these forms, Unicode Normalization Form KC (NFKC) is advised, as it performs compatibility decomposition to break down presentation characters into their constituent base elements and combining marks, facilitating conversion to shaped text without loss of meaning. This process helps integrate older Arabic data into modern pipelines while preserving semantic integrity.[25]
Special Characters and Features
Punctuation and Ornaments
The Arabic script in Unicode includes several dedicated punctuation marks that support its right-to-left writing direction and orthographic traditions, distinct from Latin equivalents to ensure proper rendering and cultural accuracy.[1] Core punctuation encompasses the Arabic comma (U+060C ،), used to separate clauses or items in lists, and also employed in modern texts with related scripts like Thaana and Syriac.[2] The Arabic semicolon (U+061B ؛) functions similarly to its Latin counterpart for separating independent clauses, with shared usage in Thaana and Syriac orthography.[2] The Arabic question mark (U+061F ؟) mirrors the Latin question mark in form but is encoded separately to align with right-to-left flow, appearing in Thaana and Syriac as well.[2] Additionally, the Arabic full stop (U+06D4 ۔), often called a period or point, terminates sentences and is prevalent in Urdu, Persian, and Punjabi texts.[2] Ornaments and decorative symbols enhance poetic and textual embellishment in Arabic script. The Arabic five-pointed star (U+066D ٭) serves as a bullet point, footnote marker, or emphasis symbol, with variable glyph appearances across fonts.[2] Decoration marks include the Arabic poetic verse sign (U+060E ؎), which denotes the start of a verse in classical poetry, and the Arabic sign Misra (U+060F ؏), marking hemistich divisions in poetic lines.[2] These elements are integral to literary and Quranic presentations, where they provide structural cues without altering the primary text flow.[26] In contextual applications, particularly religious texts, the Arabic tatweel (U+0640 ـ), also known as kashida, is inserted between letters to justify line lengths or elongate words for aesthetic balance, and it can carry diacritics when no base letter is present.[2] Currency signs like the Afghani sign (U+060B ؋) represent the Afghan currency in Pashto and Dari texts, derived from an abbreviation that has evolved into a logographic form.[26] Bidirectional text processing requires careful handling of Arabic punctuation, as these marks are classified as neutral (Bidi_Class=ON) and inherit directionality from surrounding characters, ensuring proper pairing with right-to-left elements even when embedded in left-to-right contexts like English loanwords.[27] Regional variants highlight adaptations, such as the Persian question mark (U+061F ؟), which uses the mirrored Arabic form for right-to-left alignment, differing from the non-mirrored Latin question mark (?) in mixed-script environments.[2]Ligatures and Joined Forms
In the Arabic script, ligatures are essential for cursive joining, where certain letter combinations form unified glyphs to reflect natural handwriting flow. The most prevalent mandatory ligature is the lam-alef combination, formed from the base characters Arabic Letter Lam (U+0644) and Arabic Letter Alef (U+0627), resulting in forms such as لا. This ligature is not precomposed in the core Arabic blocks but is generated dynamically through text shaping engines that analyze joining behavior, ensuring seamless rendering across positions (isolated, initial, medial, final).[26] Compatibility variants of lam-alef ligatures, such as Arabic Ligature Lam with Alef Isolated Form (U+FEFB ﻻ), exist in the Arabic Presentation Forms-B block for legacy round-trip mapping, though modern usage favors shaping over these decomposed forms.[20] Religious ligatures hold profound cultural and spiritual significance in Islamic typography, particularly in sacred texts where they symbolize reverence and prevent fragmentation of holy names. The Arabic Ligature Allah Isolated Form (U+FDF2 ﷲ), a precomposed character in the Arabic Presentation Forms-A block, represents the word "Allah" as a single glyph, traditionally used to honor the divine name in Sunni scriptural traditions. Similarly, the Arabic Ligature Sallallahou Alayhe Wasallam (U+FDFA ﷺ), encoding the phrase invoking blessings upon the Prophet Muhammad, serves an analogous role, appearing in religious manuscripts and digital Quran editions to maintain orthographic sanctity. These compatibility characters, introduced in Unicode 1.1, allow precise rendering without relying on complex shaping, though their use is contextualized within broader Islamic calligraphic practices.[6][28] Discretionary ligatures extend beyond mandatory joins, enhancing aesthetic variety in specific calligraphic styles like Naskh and Nastaliq. In Naskh, a style favored for print and digital text due to its clarity, optional ligatures such as those involving ya or waw may be applied for visual harmony, controlled via OpenType font features like the 'calt' (contextual alternates) table. Nastaliq, prominent in Persian and Urdu poetry, employs more intricate discretionary joins with sweeping connections, where font engines select variants based on stylistic sets to mimic manuscript fluidity. These are not enforced by Unicode's core joining rules but depend on font implementation for discretionary activation.[13][19] Unicode encoding strategies for Arabic ligatures prioritize flexibility in plain text while accommodating advanced control in rich formats. Base letters are combined with the Zero Width Joiner (U+200D, ZWJ) to force joins or ligatures where standard shaping might separate them, such as in stylized or historical reproductions; conversely, the Zero Width Non-Joiner (U+200C, ZWNJ) inhibits unwanted connections. This approach avoids proliferation of precomposed characters in plain text, deferring specifics to rich text environments like HTML with CSS font features or PDF embedding, ensuring portability across systems.[29] Historically, ligatures were integral to Ottoman manuscripts, where elaborate joins in styles like Diwani or Ruq'ah preserved artistic and scribal traditions in imperial documents and Qurans. Unicode's shaping model and presentation forms now enable digital preservation of these practices, allowing high-fidelity reproduction of Ottoman-era typography in archives and educational tools without loss of cursive integrity.[26][6]Core Unicode Blocks
Arabic Block
The Arabic block in Unicode spans the code point range U+0600–U+06FF, encompassing 256 positions of which 188 are assigned characters dedicated to the core elements of the Arabic script.[2] This block serves as the foundational encoding for Modern Standard Arabic, providing the essential characters for text representation in that language, including a total of 69 assigned letters across its various categories.[2] Key contents include the 28 basic letters of the Arabic alphabet, encoded from U+0621 (ARABIC LETTER HAMZA) to U+064A (ARABIC LETTER YEH), which form the skeleton of standard Arabic orthography.[2] Hamza forms are specifically covered in the initial positions U+0621–U+0626, such as U+0621 (HAMZA) and U+0622 (ALEF WITH MADDA ABOVE), enabling the representation of glottal stops and elongated vowels.[2] Core diacritics, essential for vowel indication and phonetic nuance, occupy U+064B–U+0652, including marks like U+064B (FATHATAN) for tanwin fatha (nunation with fatha) and U+0651 (SHADDA) for gemination.[2] Arabic-Indic digits appear in U+0660–U+0669, offering the traditional numeral forms used in Arabic-speaking regions, such as U+0660 (ZERO) and U+0669 (NINE).[2] Special characters in the block include punctuation and symbols for textual and religious contexts, such as U+0600 (ARABIC NUMBER SIGN) for denoting numbers without line breaks, U+0601 (ARABIC SIGN SANAH) for year indications in historical texts, and U+060E (ARABIC POETIC VERSE SIGN) to mark the end of poetic verses.[2] The block's characters are grouped into categories for practical use: letters (primarily U+0621–U+06D5, covering basic and variant forms), combining marks (U+064B–U+065F for diacritics and vowel signs), punctuation (scattered positions like U+060C for comma and U+06D4 for full stop), and digits (U+0660–U+0669 and extended variants).[2] Notably, U+06DD (ARABIC END OF AYAH) is a contextual symbol for Quranic verse endings, with its usage tied to specific rendering behaviors rather than deprecation.[2] Extensions for orthographic variants in non-Arabic languages are addressed in subsequent blocks like Arabic Supplement.[2]| Category | Range | Examples | Notes |
|---|---|---|---|
| Basic Letters | U+0621–U+064A | U+0627 (ALEF), U+0633 (SEEN) | 28 core letters forming the Arabic alphabet |
| Hamza Forms | U+0621–U+0626 | U+0624 (WAW WITH HAMZA ABOVE) | Glottal stop and initial vowel carriers |
| Diacritics | U+064B–U+0652 | U+064E (FATHA), U+0650 (KASRA) | Vowel and consonant modifiers; combining |
| Digits | U+0660–U+0669 | U+0661 (ONE), U+0666 (SIX) | Arabic-Indic numerals |
| Special Punctuation/Symbols | Various (e.g., U+0600–U+060E) | U+060C (COMMA), U+060E (POETIC VERSE SIGN) | Includes number sign, sanah, and verse markers |
Arabic Supplement
The Arabic Supplement block occupies the Unicode range U+0750 to U+077F, encompassing 48 code points designed to encode variant forms of Arabic letters tailored for non-Arabic languages, with a primary emphasis on orthographies in Africa and Pakistan.[30] Introduced in Unicode 4.1 and expanded in subsequent versions, this block addresses phonetic distinctions in regional scripts by providing precomposed letters rather than combining diacritics.[1] All 48 assigned code points represent joining letters that participate in the cursive Arabic script shaping process, often with modifications to joining behaviors to suit specific linguistic needs.[31] This block supports Arabic-script writing systems for African languages such as Hausa, Fulfulde, Songhay, and Wolof, spoken in regions including Nigeria, Chad, and other parts of Northern and Western Africa.[1] It also includes letters for Berber languages, notably Amazigh in Morocco, as well as Pakistani languages like Khowar and Burushaski.[30] For instance, U+0763 (ARABIC LETTER KEHEH WITH THREE DOTS ABOVE) is used in Berber orthographies to represent a distinct velar sound.[30] Similarly, U+0750 (ARABIC LETTER BEH WITH THREE DOTS HORIZONTALLY BELOW) aids in encoding implosive or glottalized consonants in Hausa and related African languages.[31] Another example is U+077A (ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE), which appears in Burushaski but reflects broader adaptations for tonal or emphatic features in African contexts.[30] The characters emphasize dotted and stroked variants to differentiate sounds, such as additional i'jam (dot clusters) above or below base forms, or strokes like in U+075B (ARABIC LETTER REH WITH STROKE) for fricative realizations in African scripts.[30] No combining diacritics are included; instead, the block prioritizes atomic letters that integrate seamlessly into text rendering while preserving modified joining—typically dual-joining (initial, medial, final, isolated forms)—to accommodate word formation in these languages.[1] This design ensures compatibility with standard Arabic rendering engines, though custom font support may be required for optimal glyph selection in non-Arabic contexts.[31]| Code Point | Character Name | Language/Use Example |
|---|---|---|
| U+0750 | ARABIC LETTER BEH WITH THREE DOTS HORIZONTALLY BELOW | African languages (e.g., Hausa implosives) |
| U+075B | ARABIC LETTER REH WITH STROKE | Stroked variant for African fricatives |
| U+0763 | ARABIC LETTER KEHEH WITH THREE DOTS ABOVE | Berber (Amazigh) velars |
| U+077A | ARABIC LETTER YEH BARREE WITH EXTENDED ARABIC-INDIC DIGIT TWO ABOVE | Adaptations in African and Pakistani scripts |
Arabic Extended-A
The Arabic Extended-A block extends the Unicode encoding for the Arabic script by providing characters essential for orthographic variants, regional traditions, and annotation needs in non-standard Arabic usages. Spanning the range U+08A0 to U+08FF, it allocates 96 code points to accommodate these extensions.[3] Introduced in Unicode version 6.1 in 2012, the block was designed to address gaps in representing diverse Arabic-script-based writing systems, particularly those requiring unique letter forms and diacritical marks.[32] Its contents primarily support North African orthographic traditions, such as the Warsh reading, as well as annotation systems for scholarly and religious texts.[3] A core focus of the block is the inclusion of 27 letters and 40 marks among its 78 assigned characters, categorized into orthographic letters with joining behaviors, variant forms, and combining diacritics.[33] Letters often feature modifications like dots or loops to distinguish phonemes in specific languages or dialects, while marks enable precise vowel and tone indications. For instance, the Warsh orthography, prevalent in North and West African Arabic traditions, utilizes characters such as U+08BB ARABIC LETTER LAM WITH DOT ABOVE (ࢻ) to denote variant pronunciations in Quranic recitation and regional texts.[3] Similarly, U+08AA ARABIC LETTER REH WITH LOOP (ࢪ) serves as a stretched variant of the reh letter, used in certain orthographies to elongate forms for aesthetic or phonetic emphasis in manuscript traditions.[3] These letters support joining forms (initial, medial, final) to maintain cursive connectivity in Arabic script rendering.[3] Diacritics in the block enhance annotation capabilities, particularly for religious and linguistic precision. Key examples include U+08E4 ARABIC CURLY FATHA (ࣄ), a stylized short vowel mark employed in Quranic annotations to indicate specific intonations, with further Qur'anic specifics addressed in the Arabic Extended-B block.[3] Other marks, such as those for tone or emphasis in African language orthographies, combine above or below base letters to convey suprasegmental features. The block's design facilitates compatibility with core Arabic blocks, allowing seamless integration for languages like those in North Africa and Central Asia, including variants that align with Pashto orthographic extensions through shared script mechanics.[34] Overall, these elements promote accurate digital representation of historical and contemporary Arabic-script materials without relying on compatibility forms.[3]| Category | Description | Representative Examples |
|---|---|---|
| Letters | Modified base letters for regional phonetics, with joining support | U+08A0 ARABIC LETTER BEH WITH SMALL V BELOW (ࢠ); U+08BB ARABIC LETTER LAM WITH DOT ABOVE (ࢻ) for Warsh[3] |
| Joining Forms | Contextual variants (initial/medial/final) for cursive flow | U+08AA ARABIC LETTER REH WITH LOOP (initial: ࢪ) for stretched orthographies[3] |
| Diacritics | Combining marks for vowels, tones, and annotations | U+08E4 ARABIC CURLY FATHA (ࣄ) for Quranic readings; U+08CE ARABIC LARGE ROUND DOT ABOVE for emphasis[3] |
Arabic Extended-B
The Arabic Extended-B block encompasses the code point range U+0870 to U+089F, comprising 48 positions of which 42 are assigned characters.[8] Introduced in Unicode 14.0 in September 2021, this block addresses needs in Qur'anic typography and orthographic extensions for non-Arabic languages.[35][36] Its primary purpose is to facilitate precise rendering of Qur'anic annotations, including pause marks and elongation indicators that guide recitation and textual structure, while also providing letter variants for languages such as those spoken in Africa and Southeast Asia.[8] For instance, symbols like the Arabic small high word al-juz (U+0898) denote major sectional pauses in the Quran, and the Arabic doubled madda (U+089E) represents extended vowel lengthening essential for rhythmic intonation.[8] These annotations enhance digital displays of sacred texts by supporting traditional heavy marks, such as the Arabic vertical tail (U+088E), which serves as an abbreviation indicator in religious manuscripts.[8] In addition to Qur'anic features, the block includes letters tailored for non-Arabic scripts, particularly in African contexts like Hausa and Manding, as well as Southeast Asian orthographies such as Pegon for Javanese. Examples encompass the Arabic letter tah with dot below (U+088B), used to represent specific phonemes in African Arabic-based writings, and the Arabic letter keh with two dots vertically below (U+088D), which distinguishes sounds in West African languages.[8] The assigned characters form a mix of approximately 26 letters—primarily alef variants and modified consonants—and 16 marks or symbols, with all applicable forms supporting right-to-left directionality and joining behavior for proper cursive rendering.[8]| Category | Approximate Count | Examples | Purpose |
|---|---|---|---|
| Letters | 26 | U+0870 ARABIC LETTER ALEF WITH ATTACHED FATHA; U+088B ARABIC LETTER TAH WITH DOT BELOW | Phonetic distinctions in non-Arabic languages, including African scripts |
| Marks & Symbols | 16 | U+088E ARABIC VERTICAL TAIL; U+0898 ARABIC SMALL HIGH WORD AL-JUZ | Qur'anic pauses, elongations, and annotations for recitation guidance |
Arabic Extended-C
The Arabic Extended-C block spans the code point range U+10EC0 to U+10EFF, encompassing 64 positions dedicated to specialized extensions of the Arabic script.[15] Introduced in Unicode 15.0 (2022), it addresses encoding needs for regional variations in Qur'anic recitation and annotation, particularly in Turkey and Libya, while also supporting additional letters for the Pegon script used in Javanese Arabic writing in Indonesia.[37] This block fills critical gaps in representing orthographic traditions that were previously unencoded or approximated, enhancing digital support for religious texts and minority scripts.[38] The block's development reflects ongoing efforts to incorporate diverse Arabic script usages. Unicode 15.0 assigned the initial three characters, all as low-placed Qur'anic marks for Turkish recitation traditions.[39] Unicode 16.0 (2023) added four more, including three variant letters for Pegon and one combining mark for Libyan Qur'ans, increasing the total to seven assigned code points.[40] By Unicode 17.0 (2024), the count reached 21 assigned characters through further proposals, with additional characters provisionally allocated for Unicode 18.0, such as U+10EF9 ARABIC MARK CROWN for decorative marks.[15][17] Central to the block are Turkish Qur'anic marks, such as Arabic small low word sakta (U+10EFD), which signals a brief pause in reading; Arabic small low word qasr (U+10EFE), indicating shortened vowel pronunciation; and Arabic small low word madda (U+10EFF), denoting elongation.[39] These low-placed diacritics differ from higher variants in earlier blocks, providing precise rendering for Turkish mushafs. For Libyan traditions, the Arabic combining alef overlay (U+10EFC) overlays an alef shape to mark specific recitational features.[40] Pegon letters, used in Indonesian Javanese contexts to adapt Arabic for local languages, include variants like Arabic letter dal with two dots vertically below (U+10EC2), Arabic letter tah with two dots vertically below (U+10EC3), and Arabic letter kaf with two dots vertically below (U+10EC4).[37] Additional Indonesian-specific marks, such as Arabic small yeh barree with two dots below (U+10EC5) for unwritten yeh in Uthmanic rasm and Arabic small low noon (U+10EFB), further support Pegon orthography.[15] The assigned characters emphasize diacritics for Qur'anic precision, variant letters for regional scripts, and ligatures for honorific phrases, exemplified by Arabic ligature alayhaa as-salaatu was-salaam (U+10ED1) and similar forms (U+10ED2–U+10ED8) used in religious texts.[15] Other inclusions, like Arabic letter thin noon (U+10EC6) for medial forms in Warsh orthography and Arabic double vertical bar below (U+10EFA) as a tanween mark in Old Sindhi, highlight niche applications.[37] With only 21 of 64 code points allocated in Unicode 17.0, the block's incomplete coverage underscores potential for future additions to encompass more unencoded variants from global Arabic traditions.[15] It builds on prior extensions like Arabic Extended-B by targeting post-2022 specifics in religious and cultural annotations.[39]| Category | Examples | Usage Context | Code Points |
|---|---|---|---|
| Turkish Qur'anic Marks | Small low word sakta, qasr, madda | Recitation pauses and vowel adjustments in Turkish mushafs | U+10EFD–U+10EFF |
| Libyan Qur'anic Marks | Combining alef overlay | Overlay for alef in Libyan readings | U+10EFC |
| Pegon Letters (Indonesia) | Dal/tah/kaf with two dots vertically below | Variant letters in Javanese Arabic script | U+10EC2–U+10EC4 |
| Indonesian Marks | Small yeh barree with two dots below, small low noon | Uthmanic rasm and noon indications in Pegon | U+10EC5, U+10EFB |
| Honorific Ligatures | Alayhaa as-salaatu was-salaam variants | Religious phrases in texts | U+10ED1–U+10ED8 |
Presentation and Compatibility Blocks
Arabic Presentation Forms-A
The Arabic Presentation Forms-A block provides compatibility characters for the Arabic script, encoding precomposed ligatures and contextual forms essential for legacy typesetting and display systems that lack built-in text shaping capabilities. These characters allow direct rendering of joined letter combinations and word forms without requiring complex rendering engines, facilitating compatibility with older software and fonts used for Arabic, Persian, Urdu, and related languages.[6] The block primarily focuses on multi-letter ligatures and variant letter forms, distinguishing it from single-letter positional variants found elsewhere.[41] Spanning the range U+FB50–U+FDFF, the block encompasses 688 code points, with 611 assigned to specific characters.[42] Among its key contents are language-specific letter variants, such as those for Persian and Urdu, exemplified by U+FB56 ARABIC LETTER PEH ISOLATED FORM (پ), which represents the additional letter "peh" not present in standard Arabic. Another example is U+FB92 ARABIC LETTER TTEH ISOLATED FORM for the aspirated "t" sound in languages like Sindhi. These variants typically include isolated and final forms to support basic display in non-shaping environments. A significant portion of the block is dedicated to word ligatures, including 32 religious phrases and honorific expressions commonly used in Islamic texts and formulaic Arabic writing.[6] These precomposed forms, such as U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM (﷽), enable straightforward rendering of the Basmala invocation without decomposition.[43] Other notable examples include U+FDF2 ARABIC LIGATURE ALLAH ISOLATED FORM (ﷲ) and U+FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM (ﷺ), which represent common benedictions following mentions of the Prophet Muhammad.[44] Such ligatures preserve traditional calligraphic styles in digital text, particularly for Quranic and devotional content. The characters in this block are grouped by function, including language-specific forms, two- or three-letter joins, and dedicated word ligatures for religious or idiomatic use. The following compact table summarizes representative categories with examples for quick reference:| Category | Description | Examples (Code Point, Name, Glyph) |
|---|---|---|
| Persian/Urdu/Sindhi Variants | Additional letters with isolated/final forms for non-Arabic languages | U+FB56, ARABIC LETTER PEH ISOLATED FORM, پ U+FB8E, ARABIC LETTER PEH FINAL FORM, ـپ U+FB92, ARABIC LETTER TTEH ISOLATED FORM, ٹ |
| Common Letter Ligatures | Joined forms of two or more letters in contextual positions | U+FBA6, ARABIC LIGATURE LAM WITH ALEF FINAL FORM, لأ U+FC0A, ARABIC LIGATURE BEH WITH YEH FINAL FORM, بي U+FD92, ARABIC LIGATURE MEEM WITH JEEM WITH KHAH INITIAL FORM, مجخ |
| Religious Word Ligatures | Precomposed phrases for honorifics and invocations | U+FDF0, ARABIC LIGATURE SALLA USED AS KORANIC STOP SIGN, ﷰ U+FDF2, ARABIC LIGATURE ALLAH ISOLATED FORM, ﷲ U+FDFD, ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM, ﷽ |
Arabic Presentation Forms-B
The Arabic Presentation Forms-B block in Unicode spans the code point range U+FE70 to U+FEFF, encompassing 144 positions within the Basic Multilingual Plane.[20] Of these, 141 characters are assigned, providing compatibility representations for Arabic script elements that predate advanced shaping engines.[20] This block was introduced to support legacy digital Arabic typography, where fixed glyph forms were necessary due to the absence of dynamic contextual rendering in early systems.[20] The block primarily contains single-letter positional forms for Arabic characters, including isolated, initial, medial, and final variants, as well as spacing and modified diacritic marks.[20] For instance, positional forms of the letter beh (ب) include U+FE8F (ARABIC LETTER BEH ISOLATED FORM, ﺏ), U+FE90 (ARABIC LETTER BEH FINAL FORM, ﺐ), U+FE91 (ARABIC LETTER BEH INITIAL FORM, ﺑ), and U+FE92 (ARABIC LETTER BEH MEDIAL FORM, ﺒ).[45] Modified marks feature variants such as U+FE76 (ARABIC FATHA ISOLATED FORM, ﹶ), U+FE77 (ARABIC FATHA MEDIAL FORM, ﹷ), and U+FE70 (ARABIC FATHATAN ISOLATED FORM, ﹰ), which adjust diacritics like fatha and shadda for positional spacing.[45] These forms ensure consistent appearance in environments lacking glyph substitution rules, such as early word processors or fixed-font displays.[20] Characters in this block are systematically organized by base letter, progressing through their four primary positional states (isolated, final, initial, medial), followed by sections for diacritic variants and combining marks.[20] Diacritics are grouped by type, with isolated, final, and medial forms distinguished to mimic script joining behavior without actual ligation.[20] The block also includes the control character U+FEFF (ZERO WIDTH NO-BREAK SPACE), historically used for byte order marking or as a word joiner, though its application as a joiner is deprecated in favor of U+2060 (WORD JOINER).[20] While these precomposed forms remain useful for round-trip compatibility in legacy data interchange, modern Unicode-conformant systems prefer generating such variants dynamically from the core Arabic block (U+0600–U+06FF).[20]| Category | Example Code Points | Representative Glyphs and Names |
|---|---|---|
| Positional Letter Forms (e.g., Beh) | U+FE8F–U+FE92 | ﺏ (isolated), ﺐ (final), ﺑ (initial), ﺒ (medial) |
| Diacritic Variants (e.g., Fatha) | U+FE70, U+FE76–U+FE77 | ﹰ (fathatan isolated), ﹶ (fatha isolated), ﹷ (fatha medial) |
| Other Marks | U+FE71, U+FE7C | ﹱ (tatweel with fathatan), ﹼ (shadda isolated) |
Numeral and Historical Symbol Blocks
Rumi Numeral Symbols
The Rumi Numeral Symbols block occupies the Unicode range U+10E60–U+10E7F, encompassing 32 code points of which 31 are assigned.[46] This block encodes characters for the Rumi numeral system, a historical additive numbering method with limited positional features, employed primarily in North Africa (notably Fez, Morocco) and al-Andalus (Iberian Peninsula) from the 10th to 17th centuries CE.[47] The system, also known as Fasi or zimam numerals, derives from Coptic or Greek-Coptic traditions and was used in Arabic-script manuscripts for foliation, chapter notations, accounting records, astronomical instruments, and mathematical calculations.[47] It supports scholarly analysis of historical Islamic science, mathematics, and commerce by providing digital representations of these non-positional symbols.[47] Introduced in Unicode version 5.2 (October 2009), the block facilitates the preservation and rendering of Rumi numerals in modern digital texts, particularly for academic and archival purposes.[48] Unlike positional decimal systems, Rumi numerals operate additively: base symbols represent units, tens, or hundreds, with higher orders like thousands (up to 9000) formed by placing one or more horizontal bars beneath the base numeral.[47] For instance, a bar under the symbol for 3 denotes 3000. Fractions are indicated by special symbols or by a slash separating numerator (positioned top-right) and denominator (bottom-left) relative to a base numeral.[47] The block's characters are categorized into three main groups: digits for units 1–9, higher-order numbers for tens through hundreds, and fractions. Digits include U+10E60 𐹠 (Rumi Digit One) through U+10E68 𐹨 (Rumi Digit Nine). Higher units range from U+10E69 𐹩 (Rumi Number Ten) to U+10E72 𐹲 (Rumi Number Ninety), and from U+10E73 𐹳 (Rumi Number One Hundred) to U+10E7A 𐹺 (Rumi Number Nine Hundred). Fractions comprise U+10E7B 𐹻 (Rumi Fraction One Half), U+10E7C 𐹼 (Rumi Fraction One Quarter), U+10E7D 𐹽 (Rumi Fraction One Third), and U+10E7E 𐹾 (Rumi Fraction Two Thirds).[46]| Category | Code Points | Examples | Description |
|---|---|---|---|
| Digits (1–9) | U+10E60–U+10E68 | 𐹠 (One), 𐹤 (Five), 𐹨 (Nine) | Basic unit symbols forming the foundation of additive combinations.[46] |
| Numbers (10–900) | U+10E69–U+10E7A | 𐹩 (Ten), 𐹰 (Fifty), 𐹺 (Nine Hundred) | Symbols for tens, hundreds, and their multiples; used with digits for larger values.[46] |
| Fractions | U+10E7B–U+10E7E | 𐹻 (One Half), 𐹽 (One Third) | Dedicated glyphs for common fractions in accounting and measurements; additional fractions via contextual notation.[46][47] |
