Package Tashaphyne :: Module normalize
[hide private]
[frames] | no frames]

Module normalize

source code

Utility functions used by to prepare an arabic text to search and index .

Functions [hide private]
    Indivudual Functions
unicode.
strip_tashkeel(text)
Strip vowel from a text and return a result text.
source code
unicode.
strip_tatweel(text)
Strip tatweel from a text and return a result text.
source code
unicode.
normalize_hamza(text)
Normalize Hamza forms into one form, and return a result text.
source code
unicode.
normalize_lamalef(text)
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text.
source code
unicode.
normalize_spellerrors(text)
Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and Tand return a result text.
source code
    Normalize One Function
unicode.
normalize_searchtext(text)
Normalize input text and return a result text.
source code
Variables [hide private]
  HARAKAT_pat = re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u06...
  HAMZAT_pat = re.compile(r'[\u0624\u0626]')
  ALEFAT_pat = re.compile(r'[\u0622\u0623\u0625\u0654\u0655]')
  LAMALEFAT_pat = re.compile(r'[\ufefb\ufef7\ufef9\ufef5]')
  AIN = u'ع'
  ALEF = u'ا'
  ALEF_HAMZA_ABOVE = u'أ'
  ALEF_HAMZA_BELOW = u'إ'
  ALEF_MADDA = u'آ'
  ALEF_MAKSURA = u'ى'
  ALEF_WASLA = u'ٱ'
  BEH = u'ب'
  BYTE_ORDER_MARK = u''
  COMMA = u'،'
  DAD = u'ض'
  DAL = u'د'
  DAMMA = u'ُ'
  DAMMATAN = u'ٌ'
  DECIMAL = u'٫'
  EIGHT = u'٨'
  FATHA = u'َ'
  FATHATAN = u'ً'
  FEH = u'ف'
  FIVE = u'٥'
  FOUR = u'٤'
  FULL_STOP = u'۔'
  GHAIN = u'غ'
  HAH = u'ح'
  HAMZA = u'ء'
  HAMZA_ABOVE = u'ٔ'
  HAMZA_BELOW = u'ٕ'
  HEH = u'ه'
  JEEM = u'ج'
  KAF = u'ك'
  KASRA = u'ِ'
  KASRATAN = u'ٍ'
  KHAH = u'خ'
  LAM = u'ل'
  LAM_ALEF = u''
  LAM_ALEF_HAMZA_ABOVE = u''
  LAM_ALEF_HAMZA_BELOW = u''
  LAM_ALEF_MADDA_ABOVE = u''
  MADDA_ABOVE = u'ٓ'
  MEEM = u'م'
  MINI_ALEF = u'ٰ'
  NINE = u'٩'
  NOON = u'ن'
  ONE = u'١'
  PERCENT = u'٪'
  QAF = u'ق'
  QUESTION = u'؟'
  REH = u'ر'
  SAD = u'ص'
  SEEN = u'س'
  SEMICOLON = u'؛'
  SEVEN = u'٧'
  SHADDA = u'ّ'
  SHEEN = u'ش'
  SIX = u'٦'
  STAR = u'٭'
  SUKUN = u'ْ'
  TAH = u'ط'
  TATWEEL = u'ـ'
  TEH = u'ت'
  TEH_MARBUTA = u'ة'
  THAL = u'ذ'
  THEH = u'ث'
  THOUSANDS = u'٬'
  THREE = u'٣'
  TWO = u'٢'
  WAW = u'و'
  WAW_HAMZA = u'ؤ'
  YEH = u'ي'
  YEH_HAMZA = u'ئ'
  ZAH = u'ظ'
  ZAIN = u'ز'
  ZERO = u'٠'
  __package__ = 'Tashaphyne'
  simple_LAM_ALEF = u'لا'
  simple_LAM_ALEF_HAMZA_ABOVE = u'لأ'
  simple_LAM_ALEF_HAMZA_BELOW = u'لإ'
  simple_LAM_ALEF_MADDA_ABOVE = u'لآ'
Function Details [hide private]

strip_tashkeel(text)

source code 

Strip vowel from a text and return a result text. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • SHADDA
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example:

>>> text=u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_tatweel(text)

source code 

Strip tatweel from a text and return a result text.

Example:

>>> text=u"العـــــربية"
>>> strip_tatweel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

normalize_hamza(text)

source code 

Normalize Hamza forms into one form, and return a result text. The converted letters are :

  • The converted lettersinto HAMZA are: WAW_HAMZA,YEH_HAMZA
  • The converted lettersinto ALEF are: ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW ,HAMZA_ABOVE, HAMZA_BELOW

Example:

>>> text=u"أهؤلاء من أولئكُ"
>>> normalize_hamza(text)
اهءلاء من اولءكُ
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_lamalef(text)

source code 

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are :

  • LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

Example:

>>> text=u"لانها لالء الاسلام"
>>> normalize_lamalef(text)
لانها لالئ الاسلام
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_spellerrors(text)

source code 

Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and Tand return a result text. In some context users omit the difference between TEH_MARBUTA and HEH, and ALEF_MAKSURA and YEh. The conversions are:

  • TEH_MARBUTA into HEH
  • ALEF_MAKSURA into YEH

Example:

>>> text=u"اشترت سلمى دمية وحلوى"
>>> normalize_spellerrors(text)
اشترت سلمي دميه وحلوي
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_searchtext(text)

source code 

Normalize input text and return a result text. Normalize a text by :

  • strip tashkeel
  • strip tatweel
  • normalize Hamza
  • normalize Lam Alef.
  • normalize Teh Marbuta and Alef Maksura

Example:

>>> text=u'أستشتري دمـــى آلية لأبنائك قبل الإغلاق'
>>> normalize_searchtext(text)
استشتري دمي اليه لابناءك قبل الاغلاق
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a normalized text.

Variables Details [hide private]

HARAKAT_pat

Value:
re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0651]')