Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of Shahmukhi rules, fixes #108 #200

Closed
wants to merge 19 commits into from

Conversation

bgo-eiu
Copy link

@bgo-eiu bgo-eiu commented Aug 8, 2022

Summary

This PR addresses the three concerns listed in #108:

  • Position-dependent replacement for ਉ
  • Position-dependent placement of ے
  • Use of ء

I have also added additional rules for the correct word-position use of nun gunna and alif maddah, and support for the arlam and arnun characters corresponding to their respective Gurmukhi letters ਲ਼ and ਣ.

Test

Duration

@bhajneet
Copy link
Member

bhajneet commented Aug 8, 2022

Hi, thank you for your first-time contribution! Our team has a basic understanding of this script. Can you please provide some qualifications for this work?

Also I noted that the Gurmukhi letter ਲ਼ is showing up as undefined on macOS. Can you please explain this? (Please see image below).

image

lib/toShahmukhi.js Outdated Show resolved Hide resolved
@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 9, 2022

For the character which does not appear, that is a relatively new addition to Unicode from 2020 - you can see a picture here https://en.wikipedia.org/wiki/Lam_with_tah_above (This is actually the first unicode character added to the extended character set specifically for writing Punjabi in Shahmukhi.) There are a handful of fonts which support it, but the best open source one is Noto Urdu Nastaleeq. Mac OS actually includes this font I think, but you may still need the latest version from https://github.com/notofonts/nastaliq/releases

I realize it would definitely be helpful to explain the rest of the rules I've included here - I will share a full write up later tonight when I have time.

moved contents to where it's used
@lgtm-com
Copy link

lgtm-com bot commented Aug 9, 2022

This pull request introduces 1 alert when merging afb3cd1 into 24a2e06 - view on LGTM.com

new alerts:

  • 1 for Syntax error

@lgtm-com
Copy link

lgtm-com bot commented Aug 9, 2022

This pull request introduces 1 alert when merging e72e68f into 24a2e06 - view on LGTM.com

new alerts:

  • 1 for Syntax error

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 9, 2022

Most of these notes can be observed in these dictionaries:

Note that these are both old enough that they do not include the newer Unicode characters used in Shahmukhi. Punjabi University's dictionary is mostly very good but a small issue is that they do not use the character ۓ where it is supposed to be used, instead they use ئی. There are instances where it is correct to use ۓ however, otherwise there would be no reason for the character to exist, and you can find plenty of examples of the words it is used in searching elsewhere.

Going in order from the original issue, then getting into some of the rest:

  • Just as ਉ is equivalent to اُ at word beginnings but و at word endings, the same applies to ਊ with an extra diacritical marker on the character before و (optional in Shahmukhi, but since this function already uses short vowel diacritics, I've included it).

  • There is one odd but very common exception involving ਉ which does not follow any of the general rules, which is that ਉਹ is represented as اوہ. I have added this explicitly as the pronouns ਉਹ and ਉਹਨਾਂ are very common. There have been a few studies published on Gurmukhi-Shahmukhi transliteration published (see https://aclanthology.org/C08-3009.pdf), and an observation they have made is that while that vast majority of Punjabi words have a simple one-to-one transliteration between the scripts based on the typical rules, several of the most common Punjabi words don't follow the "typical" transliteration rules. ਉਹ is probably the most common of these, which is why I have included it in this PR.

  • As pointed out in the original issue, ے is only used at word endings, specifically if a word ends in ਯ (which is really quite rare) or ਏ or ਐ.

  • It is correct that a hamza needs to be added for multiple vowel connections, but it is only for specific vowel combinations which require a glottal stop in between to pronounce them. Gurmukhi mostly has no need to indicate glottal stops explicitly, but Arabic-based scripts require it. See the image here:

IMG_20220809_002743_939

This is from Gulshan-i-Urdu, published in Malerkotla (predominantly Muslim part of Indian Punjab) which is a short book intended to teach the Urdu writing conventions to people familiar with reading in Gurmukhi by way of Shahmukhi representations of Punjabi words and instructions based on Gurmukhi character combinations. This is quite a good reference table for the vowel combinations which require a hamza. Note however that there are independent unicode characters for each of the letters hamza can attach to; these are used instead of a combining character. The rules for hamza I included are the same as those listed above, including ۓ, but only where appropriate at the end of a word.

  • There are a few very rare words derived from Persian and Arabic which have a glottal stop which occurs with a "default" vowel. By that I mean the implied "schwa" vowel that comes after most consonants in Punjabi. The rare words which end in a glottal stop + the default vowel can be represented with ਅ as an independent character (rather than ਆ) in Gurmukhi, and the hamza ء as an independent character. An example which appears in the Punjabi-English dictionary is ਪਰਗਟਾਅ / پرگٹاء.

  • Alif maddah آ can be seen at the beginning of words and very occasionally in the middle of words, but (almost) never at the end of words. This is just another position-based rule of Arabic-based scripts that you can observe in Shahmukhi writing. (The exceptions are rare and would have to be accounted for individually.)

  • Just like ے only appears at the end of words, nun gunna ں is also a position dependent character which only occurs at the end of words. Otherwise just ن is used. This is a common rule of written Urdu and the same follows for Shahmukhi. (99% of the time this is true anyway - there are a few very strange words that do not follow this rule. For example ਅਲੂੰਆਂ / الوں‌آں. Alif maddah would not normally appear like that either but a zero-width non-joiner unicode character is used to break the nun from connecting to it so that it looks more like it does at the beginning of a word. I did not include anything for exceptions like this because for now I am just trying to fix things that can be addressed with general rules first.)

  • Arlam ࣇ was mentioned; the other Shahmukhi character which is not used in Urdu is arnun ݨ for representing the same retroflex nasal sound as ਣ. The use of the letter seems to have caught on among some writers in Pakistani Punjab more than others; that it is not always used is partly because of it being a more recent addition. I think it is a good idea to include though because this is a very prominent and important sound in the Punjabi language and it is not as useful to have it look the same as ن. This was actually first added to Unicode for Saraiki, which represents a kind of dialect continuum in southwest Punjab between "standard" Punjabi and Sindhi. Sindhi uses ڻ with no dot for this, but Punjabi could not use this character before because it looks exactly the same as ٹ in the middle of a word. Sindhi does not use ٹ but Punjabi does so they could avoid this problem.

  • ਕ਼ is kind of a "fake" Gurmukhi letter because it represents a sound most Punjabi people can't pronounce. The pronunciation is not important for this though, this is the same letter as ق and q which appears in common Muslim names used in Pakistan. People do not pronounce Iqbal with the actual Arabic sound ق is supposed to represent, they pronounce it as Ikbal, but it is still written like this regardless. Following this pattern, you can find examples of Iqbal spelled as ਇਕ਼ਬਾਲ in Gurmukhi (see the use in here https://jagbani.punjabkesari.in/punjab/news/iqbal-singh-of-hoshiarpur-wins-bronze-medal-in-asia-rowing-championship-1152513). This is not very common, but since ਕ਼ does always mean ق, I thought I might as well include it.

--

Further, I do have plans to account to some of the more complex cases beyond these general rules. There are a minority of words which cannot be accounted for by any general rule and just have to be explicitly hardcoded for. This is a longer term project I am working on of creating a table of these words that could be used to fill in the remaining gaps in coverage for transliterating from Gurmukhi to Shahmukhi. The most common of these are words which end in ہ in Shahmukhi, which could have multiple different endings in Gurmukhi. ਇਹ in Shahmukhi is ایہہ, with the letter ہ repeated twice at the ending. ਸਬਜ਼ਾ is سبزہ, ਕਹਿ is کہہ, and so on.

There are also several letters (ح خ ث غ ع ط ظ ص ض ق) which do not represent any particular sound used in Punjabi or similar Indic languages but are still used in spelling of various words in Punjabi and Urdu. These are mostly words derived from Persian or Arabic which are not necessarily pronounced like the words they are derived from. Gurmukhi spells these closer to phonetically, but the Shahmukhi spellings preserve the older characters. You can see some examples of these words in Waris Shah's poetry as a reference point. Then there are some very strange words which you can find here and there which just break the rules above, including even an example of ے in the middle of a word.

Shahmukhi also has a tendency to break the connection of a character before ਗਾ / گا and its inflections where they appear at the end of verb forms. I am not sure how to implement this yet but I intend to investigate this as well. If you look at the very end of the page of the book I posted a photo of above, you can actually see an example which illustrates multiple elements of the exceptional cases discussed here. ۓ is permitted in the middle of that last word even though it is normally an end-of-word character, because it is followed by گا (ਗਾ). I think it is possible to come up with a way to handle ਗਾ word endings specifically, but I want to think about it more because I want to make sure it only gets applied to the right words.

It would also be interesting to try working on a "toSaraiki" version of this for the Saraiki extended Shahmukhi. There are actually a number of words which appear in the Guru Granth Sahib which are no longer used in "standard" Punjabi but which have been preserved in Saraiki. The Saraiki version of Shahmukhi also has additional characters for the rarer consonants like ਞ which are still used in some common Saraiki words.

@bhajneet
Copy link
Member

bhajneet commented Aug 9, 2022

Thank you for this incredible write up! It's very comprehensive and helpful.

My notes from above:

  • As I understand, the ਲ਼ is used for foreign loanwords (a sound not heard in Indian languages historically). Can you give some examples of words using ਲ਼ and their language of origin? Bonus points for the actual word. Using the dictionary you've linked above, so far I'm only finding words that are typically spelled the same way but with a normal ਲ. I'm guessing this is not a major issue, as we bundle Google's Noto in our products.

  • Is the goal of this function to help readers familiar with the nastaliq script pronounce gurmukhi or for shahmukhi speakers to understand the words? (Meaning, is it a speech transcription or a dictionary translation).

    • If it is a transcription, is it possible to create a literal 1-1 transliteration (meaning use toShahmukhi and then use toGurmukhi to get the same word?)
    • Based on the above we can help direct the various functions (including the toSaraiki, which I think is a nice idea).
  • A quick note: if/when using ZWJ/ZWNJ, please denote them explicitly in strings with \uXXXX.

  • Lastly, a minor thing: there are some issues with linting / testing that need to be resolved

I see you have a toShahmukhi repo based on this repo. It used to be in MIT and had to be switched to GPLv3. Do you have a preference to work on MIT projects? As an org we started shifting our repos over to MIT (such as our python gurmukhiutils), but have yet to do that with this project. If you are willing to work on the nastaliq functions from scratch (meaning no copying of tests, no copying of this current implementation, and written by hand yourself), then we can start an MIT version of this repo too. I would be happy to help you out with the set up, so just let me know!

In completely other news, I'm very happy you're contributing as it's rare for us to have anyone help out. Would you mind sharing a little bit about how you got to this project and why you're willing to work on it?

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 9, 2022

So ਲ਼ is actually different from the other letters with bindi in that it is supposed to represent a native Punjabi sound - it's a very slight distinction between ਲ and ਲ਼ but it is one that has been present for a long time. Some common words that have the sound are ਚੌਲ਼ (rice) and ਉਂਗਲ਼ (finger), but I think whether or not someone pronounces the sound differently is dialect-dependent (and what they write may not necessarily correspond to how they pronounce the word).

It is definitely optional to use ਲ਼ as many writers have not used it, but this entry in Punjabi University's dictionary is a good example of why they choose to use it in their headwords:
https://dic.learnpunjabi.org/default.aspx?look=%E0%A8%B8%E0%A9%81%E0%A8%86%E0%A8%B2%E0%A8%A3%E0%A8%BE

Verbs are generally an interesting place to look at Punjabi-specific phonetic tendencies since very few Punjabi verbs are loaned from elsewhere; they typically have a Sanskrit origin and are inflected based on rules internal to the language. The dictionary is trying to represent two different dialectal pronunciations of this word, which you can hear in the audio sample: ਸੁਆਲਣਾ vs. ਸੁਆਲ਼ਨਾ. This is telling about what type of sound ਲ਼ is, because it's presence means that ਣਾ gets replaced with ਨਾ. ਲ਼ involves a "retroflex" sound that requires more tongue movement than ਲ, which makes it harder to pronounce the nasalized ਣਾ afterwards. This occurs in a variety of verbs; for example, ਕਰਨਾ is not ਕਰਣਾ because ਰ represents another sound which makes it hard to pronounce ਣਾ right afterwards. It follows that if you see a word written ending in -ਲਨਾ rather than -ਲਣਾ, that is a hint that at least some speakers are pronouncing ਲ਼ in the word even if they are not writing it.

Overall it is a small detail, but I figure if a writer is going out of their way to use ਲ਼ instead of ਲ, for clarifying situations like the one above, it would make sense for that to get converted to the Shahmukhi character for the same purpose. (That it is a Punjabi-specific sound is why it had to be added to the Unicode set, whereas there are already characters corresponding to the other bindi letters in Arabic-based scripts because those are related to loan words.) What I am thinking of doing to leave an option for people who may have concerns about script/font support as this is a newly added character is adding a "toUrdu" function which just wraps around "toShahmukhi" using ن and ل instead of the newer characters (limiting it to just the "original" Urdu alphabet).

The goal I would say is just for Punjabi speakers who are not used to reading Gurmukhi to be able to read resources originally written in it (this may make it easier for them to learn Gurmukhi if they are interested, too). Shahmukhi and Arabic-based scripts generally are much less dependent on phonetics, and most of what Shahmukhi essentially is doing is presenting words in a way that is easy to read for people who are used to reading Urdu. Most native Punjabi speakers in Pakistan never write in Punjabi using any script even if that is the language they speak 90% of the time as Urdu is the language required in school / professional settings that involve writing, so it is more a matter of making words people can already pronounce look recognizable to them. If you look at اوہ, you could take apart the letters and say that it is supposed to be pronounced "avo" or "awh" or a number of other things, but that is just what Punjabi speakers in Pakistan would recognize as ਉਹ. They are just looking at what the whole word looks like and the context within the sentence to know how to pronounce it, there's not enough information in the script for Shahmukhi to be readable to someone who doesn't already know Punjabi fluently.

I do intend to eventually get to a reverse Shahmukhi to Gurmukhi conversion function working, but I have that on hold for now as fine-tuning the Gurmukhi to Shahmukhi process can be done on a much shorter time scale. The reverse conversion is much more challenging and I don't think there's a tool that can quite do it properly - the biggest issues are that single letters are used to represent several sounds. و is used as both a consonant and a vowel, and to figure out if it should become ਵ or ਔ or ਓ or ਉ or ਊ in Gurmukhi you would have to consider the word in the context of the sentence and/or have a probability table of letter combinations to compare to and determine what the most likely character replacement is. I think this is doable but it is a level of complexity that will take some time to work up to. When I do get to it, the code and test will have to be from scratch because of the different considerations - I really appreciate you offering to incorporate something like it into the project and can let you know once I have a plan for how to implement it. I do slightly prefer the MIT license where possible just because it's more flexible in allowing other projects to use it without conflicting with whatever other existing terms they might have.

The initial reason I got to this project and got interested in this problem is that I am learning Punjabi as a second language in order to communicate with my family better. My parents and most of my relatives are native Punjabi speakers of Pakistani heritage. I found it helpful to learn Gurmukhi to start learning the language from books, especially since it includes distinctions about the phonetics of Punjabi, but at the moment I can't share most of what I've been reading with any of the people I'm trying to speak Punjabi with since they've never been exposed to it. Then looking more into it seems like this is a problem that is solveable with an open source tool, but hasn't been taken quite all the way yet - I learned that the Serbian Wikipedia has a transliteration tool to switch between the two scripts Serbians use, Cyrllic and Latin script. That way people contributing only have to write it once in one script, and anybody can read it regardless of which Serbian script they write with. Punjabi doesn't have this yet and any website or software that has both Gurmukhi and Shahmukhi language options has split sets of translation strings, meaning twice the work is required to provide content in one language. My more ambitious goal is to have it be as easy to switch between scripts in an application as it is now for Serbian and likely some other languages which use multiple scripts.

Good point on ZWNJ, when I use it I will note it that way. I am about to fix the testing / linting issues.

@bhajneet
Copy link
Member

  • Thank you for the background on the ਲ਼. With this information, it makes more sense and I agree with your approach! Since it's optional, it is better to provide support where it is actually used.
  • I also am thankful for learning about these retroflex pronunciations in combination with the n sound. Very interesting! I also agree that l and r sounds are pronounced very similarly (some languages do not even differentiate between them like Japanese), so this logic tracks pretty well. I appreciate your explanation.
  • The way I was designing gurmukhiutils (the python one) is to differentiate between transliteration (goal of 1-1 reverse mappability) and pronunciation. The first is a script mapping, but does not have to be perfect for pronunciation (it can not make sense to the average reader), but allows for better back and forth. The second allows for multiple to one or one to multiple character mappings for readers/speakers. With this in mind, I do want to start working on a new major version of this JS library at some point which makes the distinction between transliteration (script mapping) and pronunciation (speech/reading). The pronunciation can be a very deep rabbit hole, as it can get to the point where you are parsing syllables for schwa deletion (based on your messages, it seems you are familiar with this concept). Schwa deletion not being a perfect science, it can definitely lead to a very deep rabbit hole. Basically, all in all, I consider transliteration to be a script change (gurmukhi to latin/roman or gurmukhi to nastaliq) and I am considering pronunciation to be something like gurmukhi to shahmukhi/saraiki or gurmukhi to American English or Spanish. I also agree that pronunciation can be easily fixed if the script change is mostly done properly. So meaning you can do a transliteration, and then do pronunciations for Shahmukhi / Saraiki based on that (since transliteration does not lose any of the original data).
  • I know this was a lot of conversing back and forth. It has definitely been rewarding for me to learn about all this from you. This PR looks good to me. If you can just fix the lint/tests I'm sure @Harjot1Singh will agree we can merge this in. If you are having any trouble with the linting / testing, just let me know I'm happy to work on this branch with you to get it fixed up so we can merge it in!
  • If you want to talk about the next major version of gurmukhi-utils (javascript) let me know! It'll be MIT and will be written from scratch. I think there are definitely some better design decisions we can make for this library. If you want to connect with me on slack, please join via our shared invite. We have a #gurmukhi-utils channel and you can also DM me @bhajneet

@Harjot1Singh Harjot1Singh linked an issue Aug 17, 2022 that may be closed by this pull request
@Harjot1Singh
Copy link
Member

Harjot1Singh commented Aug 17, 2022

@bgo-eiu apologies I've not been able to reply sooner! Really grateful for you sharing this knowledge and your contribution!

Let me know if you need assistance fixing the test cases or linting

@sarabveer sarabveer self-requested a review August 20, 2022 23:37
@sarabveer
Copy link
Collaborator

sarabveer commented Aug 20, 2022

Wow this is great work, this contribution is greatly welcomed! Not being a Shamukhi reader myself, I was not able to work on the issue at all.

Quick question, I have been thinking for a while to remove the replacement that happens here: https://github.com/bgo-eiu/gurmukhi-utils/blob/patch-1/lib/toShahmukhi.js#L178.

For example, there are words like: ਮੁਖਹੁ or ਸਿਮਰਿ. These are not found in regular Punjabi, but in older texts that employ the Gurmukhi script. Currently, the endings are stripped so they become, for example: ਮੁਖਹ or ਸਿਮਰ.

If the endings are not stripped, would that interfere with the transliteration in any way?

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 22, 2022

@sarabveer I was actually going to ask about removing strip endings - it would be helpful to know how words like ਮੁਖਹੁ are meant to be pronounced.

It is conventional to omit vowel diacritics in regular writing in Arabic-based scripts, and if anything Urdu and Shahmukhi writers do this more than Arabic writers. This can become confusing, because in Arabic, it is not typical for a word to end in a short vowel, but in Punjabi, short vowel endings are common and can be important for distinguishing words. For a transliteration function, it makes sense to preserve those vowels because you can always strip them after if you prefer, but you cannot add them back to an ambiguous word without context. So I would lean towards not stripping the endings preliminarily, but if these particular older words are meant to be pronounced differently than written, I could incorporate some rules for those since it will be necessary to account for some exceptions anyway. Some interesting examples from Punjabi University's dictionary:

  • ਸਹਿ <-> سہِ
  • ਆਦਿ <-> آد
  • ਕਿ <-> کہ

In the audio samples, you can hear that the vowel at the end of ਆਦਿ is not really being said, and so there would be no reason to indicate it in Shahmukhi. Words like the first one where the vowel is pronounced are more common though and do benefit from this kind of clarification. ਕਿ is a very common word which has a weird Shahmukhi spelling کہ where the ending vowel is important enough that it is represented by a whole consonant letter choti he ہ so it cannot be omitted. (I had not thought of it before, but this is kind of like the Punjabi equivalent of ta marbouta ة in Arabic where a consonant is used for certain vowel endings.) There is a logic to it and I think it may be as simple as applying to any word ending in ਕ followed by a short vowel but I want to investigate a little bit more to be sure.

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 22, 2022

Sorry to have gone quiet for a bit, I realized I had to go back to the drawing board to address some issues in testing it, but it's closer to where I intended now. I've been feeding in various Gurmukhi strings and noticed some details I had missed before. I intend to update the unit tests with some strings extracted from different sources, like the poems on the Punjabi Kavita site in both Gurmukhi and Shahmukhi, as there are some quirks which may not be obvious when writing the strings manually.

  • Some fixed width fonts make ہ Arabic small he ہ Punjabi/Urdu choti he and ھ Punjabi/Urdu do chasmi he look near identical. So there were some strings that looked just right until I took them out of the terminal and rendered them in a proper Nastaliq font and saw the letters were not what they seemed. For editing, Noto doesn't have a fixed width font addressing these concerns as far as I'm aware, but this one NoName Fixed is really good and even includes the newer extended Shahmukhi letters https://github.com/aliftype/noname-fixed
  • I had no idea that the JavaScript implementation for regular expressions does not reliably discern word breaks in non-Latin scripts, so \b will not necessarily match the beginning of every Gurmukhi word. So I have replaced \b with a regular expression which matches any character outside of the Gurmukhi Unicode range.
  • ਜ਼ and ਜ਼ look identical and various inputs and renderings treat them identically. The second one is formed from ਜ + ਼ however, so there had to be separate rules to normalize the detached bindis.
  • The geminating diacritic in Gurmukhi is ੱ and in Shahmukhi is ّ . This is quite important as for example ਇੱਕ and ਇਕ are not the same, and ੱ gets dropped in causative verb constructions. What makes this different from everything else though, is you cannot actually swap with a 1-for-1 rule. ਅੱਗ would become اگّ rather than اّگ, with the marker coming after the consonant rather than before. ਅੱਥ would become اتھّ, after the combining ھ and not just the consonant. The workaround now is to make ਅੱਗ into ਅਗੱ before anything else happens.
  • There was a very interesting word in the existing tests that the Gulshaan-e-Urdu Gurmukhi conversion guide has a section about:
    IMG_20220810_172501_975
    ਖ਼ੁ in Gurmukhi always represents the character combination خوِ in Persian loan words. There would be no way to tell that the zer is under the wau وِ following the regular rules but this is just a weird quirk in how Gurmukhi has represented these words.

…mmented out replace endings to see how that looks
@lgtm-com
Copy link

lgtm-com bot commented Aug 22, 2022

This pull request introduces 1 alert when merging 7829ab4 into 24a2e06 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

@lgtm-com
Copy link

lgtm-com bot commented Aug 22, 2022

This pull request introduces 1 alert when merging 7c999f7 into 24a2e06 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

@bhajneet
Copy link
Member

Goal of transliteration: To be able to convert gurmukhi script to a different script and then back to gurmukhi as close as possible. It does not have to make 100% sense to the reader, but where it can be sensical it should be.

Goal of transcription: To help speakers of a certain language pronounce gurmukhi script. There are questions regarding whether some of the short vowels at the end of words are to be pronounced or not. However by common convention there are many areas they are not pronounced. It's actually funny because gurmukhi script is employed for multiple languages, so to use the rules of punjabi for all the languages is a potential pitfall (though very commonly accepted as non-problematic).


Sorry to have gone quiet for a bit, I realized I had to go back to the drawing board to address some issues in testing it, but it's closer to where I intended now. I've been feeding in various Gurmukhi strings and noticed some details I had missed before. I intend to update the unit tests with some strings extracted from different sources, like the poems on the Punjabi Kavita site in both Gurmukhi and Shahmukhi, as there are some quirks which may not be obvious when writing the strings manually.

I would recommend portioning out the unit tests and commenting them with the sources. For example if you're getting some unit test examples from the Punjabi Kavita site, group them together and prepend that block with a comment saying so. This is what I've done here for example:

image


  • Some fixed width fonts make ہ Arabic small he ہ Punjabi/Urdu choti he and ھ Punjabi/Urdu do chasmi he look near identical. So there were some strings that looked just right until I took them out of the terminal and rendered them in a proper Nastaliq font and saw the letters were not what they seemed. For editing, Noto doesn't have a fixed width font addressing these concerns as far as I'm aware, but this one NoName Fixed is really good and even includes the newer extended Shahmukhi letters https://github.com/aliftype/noname-fixed

If you're using VS Code, you're not limited to using fixed-width fonts in your editor. I personally use the following fonts:

image

This means if the character is not found in SF Pro, it will try to render it with Sant Lipi. If it's not found in Sant Lipi, then it tries to render it with Noto Sans Gurmukhi. So and on and so forth. You can set your main monospace font at the beginning, and as long as the characters are not existing (I honestly have never heard of a monospace font with nastaliq characters 😆 ), you can set up a fallback font which can be proportional width (not required to be monospace).

Just sharing this incase you didn't know, it may help you too.


  • I had no idea that the JavaScript implementation for regular expressions does not reliably discern word breaks in non-Latin scripts, so \b will not necessarily match the beginning of every Gurmukhi word. So I have replaced \b with a regular expression which matches any character outside of the Gurmukhi Unicode range.

That is very interesting, I'll have to keep that in mind, thank you.


  • ਜ਼ and ਜ਼ look identical and various inputs and renderings treat them identically. The second one is formed from ਜ + ਼ however, so there had to be separate rules to normalize the detached bindis.

Avoid writing rules for ਜ + ਼ . The combined character exists as a proper unicode point. The input text should be sanitized with a different function to normalize gurmukhi if needed.

In short, assume normalized/proper gurmukhi input for your function.


  • The geminating diacritic in Gurmukhi is ੱ and in Shahmukhi is ّ . This is quite important as for example ਇੱਕ and ਇਕ are not the same, and ੱ gets dropped in causative verb constructions. What makes this different from everything else though, is you cannot actually swap with a 1-for-1 rule. ਅੱਗ would become اگّ rather than اّگ, with the marker coming after the consonant rather than before. ਅੱਥ would become اتھّ, after the combining ھ and not just the consonant. The workaround now is to make ਅੱਗ into ਅਗੱ before anything else happens.

No need to swap 1-for-1, but rather to be able to go back and forth with whatever rules you come up with. This seems to be a non-issue to me. But a good note to know!

@lgtm-com
Copy link

lgtm-com bot commented Aug 23, 2022

This pull request introduces 1 alert when merging dc6e851 into 24a2e06 - view on LGTM.com

new alerts:

  • 1 for Unused variable, import, function or class

@sarabveer
Copy link
Collaborator

@sarabveer I was actually going to ask about removing strip endings - it would be helpful to know how words like ਮੁਖਹੁ are meant to be pronounced.

It is conventional to omit vowel diacritics in regular writing in Arabic-based scripts, and if anything Urdu and Shahmukhi writers do this more than Arabic writers. This can become confusing, because in Arabic, it is not typical for a word to end in a short vowel, but in Punjabi, short vowel endings are common and can be important for distinguishing words. For a transliteration function, it makes sense to preserve those vowels because you can always strip them after if you prefer, but you cannot add them back to an ambiguous word without context. So I would lean towards not stripping the endings preliminarily, but if these particular older words are meant to be pronounced differently than written, I could incorporate some rules for those since it will be necessary to account for some exceptions anyway. Some interesting examples from Punjabi University's dictionary:

  • ਸਹਿ <-> سہِ
  • ਆਦਿ <-> آد
  • ਕਿ <-> کہ

In the audio samples, you can hear that the vowel at the end of ਆਦਿ is not really being said, and so there would be no reason to indicate it in Shahmukhi. Words like the first one where the vowel is pronounced are more common though and do benefit from this kind of clarification. ਕਿ is a very common word which has a weird Shahmukhi spelling کہ where the ending vowel is important enough that it is represented by a whole consonant letter choti he ہ so it cannot be omitted. (I had not thought of it before, but this is kind of like the Punjabi equivalent of ta marbouta ة in Arabic where a consonant is used for certain vowel endings.) There is a logic to it and I think it may be as simple as applying to any word ending in ਕ followed by a short vowel but I want to investigate a little bit more to be sure.

These ending vowels indicate grammar in Sri Guru Granth Sahib. At the time, the information I had available made me implement the omission. But after doing research, these vowels are supposed to be part of pronunciation (even if the masses do not pronounce them).

So ਆਦਿ is pronounced ਆਦਿ, not ਆਦ.

Good to know we agree.

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 23, 2022

Thank you for the helpful feedback, I will factor out the nukta normalization since it makes sense that this would be out of scope for the function. Annotating the tests like that is a good idea as well.

I have been thinking about the transliteration / transcription distinction and while the principle of producing an output that could be converted back to Gurmukhi is important, I am increasingly thinking that we cannot really separate these ideas for Shahmukhi. This really has to do with the relationship between Gurmukhi and Shahmukhi being different than one of simply being two scripts. We can say "Gurmukhi Lipi" and "Shahmukhi Lipi" without being redundant because these writing systems have been metaphorically named as if they were spoken languages, coming from the "mouth." This is fitting especially when we consider what makes a Shahmukhi representation different from any other Arabic script. Gurmukhi makes finer distinctions in representing pronunciation that allow for pronunciation spellings within the same language, but Shahmukhi for example may use و to represent any of ੳ ਉ ਊ ਓ ੁ ੂ ਵ ੍ਵ. When Shahmukhi does make a pronunciation clarification, it is not quite for the purpose of telling the reader how to pronounce the word, but which Punjabi word to pronounce. The system assumes that the reader must already know how to speak and pronounce Punjabi first in order to be read. To that end, we can only transliterate Punjabi from Gurmukhi to Shahmukhi and back based on Punjabi pronunciation rules, and a readable Shahmukhi output and an output we can convert back to Gurmukhi are the same thing because the underlying logic comes from the language itself and not anything we can tell from the letters alone. I do not think there is really more than one way to convert to Shahmukhi, because the system has no way to indicate pronunciation independent of language, only in the context of language.

If we take the word منڈیر‎, this is the Hindi/Urdu word मुँडेर meaning a ridge used in building a retaining wall or it is the Punjabi word ਮੁੰਡੀਰ mean group of boys, gang, crew etc. So when we write مُنڈیر, the vowel clarification is just to tell the reader that this is ਮੁੰਡੀਰ, assuming they already know the word. If there was a word ਮੁੰਡੇਰ or ਮੁੰਡੈਰ or even ਮੁੰਡਯਰ, then منڈیر‎ or مُنڈیر would represent these also. ਮੁੰਡਯਰ would be impossible to clarify from ਮੁੰਡੀਰ even if we wanted to. This is usually not that confusing though because the language speaker only needs to know which word to say and not all the information about how to say it. What may be helpful to do is put together a little reference table of Shahmukhi -> Gurmukhi pairs to visualize the various one-to-many mappings. There may be some patterns observable which are not obvious when considering combinations of characters.

Re: non fixed width fonts in editors, I have tried this, but the problem is that the extended characters used in languages spoken in Pakistan are typically only included in Nastaliq fonts, which don't work as fallbacks from other fonts, and which involve character connections that editors often cannot even render without clipping them. For example, in ب + ے , the first letter goes under and after the second:
image
Trying to edit a regular expression like this is not ideal. So as silly as it sounds, it really is very helpful that NoName Fixed exists with the fixed width version of these characters.

@sarabveer
Copy link
Collaborator

sarabveer commented Aug 23, 2022

Also, I am not sure if (U+0A75) and ੍ਵ (U+0A4D U+0A35) are mentioned.

In pronunciation, these are pronounced as follows:

ਰਖੵਾ => ਰਖਿਆ
Essentially, a ਿ (U+0A3F) and ਅ replace it.

ਤ੍ਵ => ਤੁਅ
Essentially, a ੁ (U+0A41) and ਅ replace it.

Not sure how that would work in Shahmukhi.

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 27, 2022

These are good examples which I was not aware of. Do you know any text samples which use these words, or describe their meaning? That might help to track down a Shahmukhi source that has used them.

My initial thoughts are:

ਰਖੵਾ = رکھیا
The basis being that -ਗਿਆ endings are spelled گیا rather than گِا because although this is a short vowel, it is too important to allow the writer to omit it for this combination of letters. If someone were to write گا, it would change the meaning if we allowed optional vowels. A speaker of Urdu/Hindi not familiar with Punjabi would pronounce گیا as ਗੀਆ, but a Punjabi speaker would recognize what sound this is supposed to be. This combination follows a similar pattern with کھ in place of گ, and further the fact that ਖ requires two characters in Shahmukhi makes indicating the ending with a combining character awkward.

ਤ੍ਵ = توَ
This is tricky but the only way I can think of to write this at the moment. The short vowel on top in وَ is the rarest to see indicated in Shahmukhi because it is usually redundant. Only where it is truly needed to tell a word apart from others is it used. Many words which end in -ਵ are selled with وَ wherever plain و could be mistaken for ਊ. This seems similar, because if we wrote تو that would be read as ਤੂ, توا would be ਤਵਾ, and توہ would be ਤੋਹ. So توَ would seem like the most appropriate Shahmukhi spelling. تُء is technically possible, but very unlikely as standalone ء is only ever used in some obscure, usually loaned words.

@sarabveer
Copy link
Collaborator

sarabveer commented Aug 28, 2022

These are good examples which I was not aware of. Do you know any text samples which use these words, or describe their meaning? That might help to track down a Shahmukhi source that has used them.

My initial thoughts are:

ਰਖੵਾ = رکھیا The basis being that -ਗਿਆ endings are spelled گیا rather than گِا because although this is a short vowel, it is too important to allow the writer to omit it for this combination of letters. If someone were to write گا, it would change the meaning if we allowed optional vowels. A speaker of Urdu/Hindi not familiar with Punjabi would pronounce گیا as ਗੀਆ, but a Punjabi speaker would recognize what sound this is supposed to be. This combination follows a similar pattern with کھ in place of گ, and further the fact that ਖ requires two characters in Shahmukhi makes indicating the ending with a combining character awkward.

ਰਖਿਆ means protection, defense. It can also be spelled out as ਰੱਖਿਆ.

According to the Punjabi University transliterator, they do as follows:

ਰੱਖਿਆ => رکھیا

(However this includes the adhak in the translit, so not sure how ਰਖਿਆ is spelt in Shahmukhi).

Another example is ਆਗੵਿ => ਆਗਿਆ آگیا

What could be done is for this case with ੵ (U+0A75), the word can be transformed into the pronunciation form (ਰਖਿਆ, ਆਗਿਆ, etc) and then put into the Shahmukhi transliteration.

I think its better that way as there are edge cases I know of with this character where it may be difficult to implement directly, and it will make it easier for you.

ਤ੍ਵ = توَ This is tricky but the only way I can think of to write this at the moment. The short vowel on top in وَ is the rarest to see indicated in Shahmukhi because it is usually redundant. Only where it is truly needed to tell a word apart from others is it used. Many words which end in -ਵ are selled with وَ wherever plain و could be mistaken for ਊ. This seems similar, because if we wrote تو that would be read as ਤੂ, توا would be ਤਵਾ, and توہ would be ਤੋਹ. So توَ would seem like the most appropriate Shahmukhi spelling. تُء is technically possible, but very unlikely as standalone ء is only ever used in some obscure, usually loaned words.

Here is an example:

ਤ੍ਵ ਪ੍ਰਸਾਦਿ - means by your(ਤ੍ਵ) grace(ਪ੍ਰਸਾਦਿ).

According to the Punjabi University transliterator, they do as follows:

ਤ੍ਵ => تو

The phrase "ਤ੍ਵ ਪ੍ਰਸਾਦ" might also be in this photo (3rd line from top), seems like they are using تو.

image

Another example is ਬਿਸ਼੍ਵਾਸ => ਬਿਸ਼ੁਆਸ بشواس

@bgo-eiu
Copy link
Author

bgo-eiu commented Aug 28, 2022

Adhak would be indicated on رکھیا as رکھّیا. Most writers in practice would omit any indication of it unless absolutely necessary. Punjabi University's dictionary and transliteration tool are oddly inconsistent about this - sometimes they include the ّ character, sometimes they do not, and sometimes they put it in the wrong place (کّھ is incorrect and should be کھّ but they have this swapped often.) It makes sense for a transliteration tool to always include it in the output though, since this information makes it easier to convert back and we can just remove this from the output if we want.

I do agree converting to the Gurmukhi pronunciation spelling would make things easier as then the yakash words would be covered by more general rules. I am interested in making sure the edge cases work, but it will be easier to spend time on those once the function covers the 95% or so of word forms which can be derived from the general rules.

I see تو پرساد there, in both of the ੍ਵ examples I am leaning towards simply transcribing as و because adding anything to it like وَ may be confused for indicating a consonant sound. In context a reader would be able to tell تو is not ਤੂ because of the presence of پرساد.

@sarabveer
Copy link
Collaborator

Adhak would be indicated on رکھیا as رکھّیا. Most writers in practice would omit any indication of it unless absolutely necessary. Punjabi University's dictionary and transliteration tool are oddly inconsistent about this - sometimes they include the ّ character, sometimes they do not, and sometimes they put it in the wrong place (کّھ is incorrect and should be کھّ but they have this swapped often.) It makes sense for a transliteration tool to always include it in the output though, since this information makes it easier to convert back and we can just remove this from the output if we want.

Good to know.

I do agree converting to the Gurmukhi pronunciation spelling would make things easier as then the yakash words would be covered by more general rules. I am interested in making sure the edge cases work, but it will be easier to spend time on those once the function covers the 95% or so of word forms which can be derived from the general rules.

Yea I can look into implementing this.

I see تو پرساد there, in both of the ੍ਵ examples I am leaning towards simply transcribing as و because adding anything to it like وَ may be confused for indicating a consonant sound. In context a reader would be able to tell تو is not ਤੂ because of the presence of پرساد.

Alright, so و it is.

@bhajneet bhajneet closed this Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more Shahmukhi Rules
4 participants