Skip to content

blutorange/match_kanji_kana

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Marvellous Moonglyphs: Match Kana and Kanji

for http://meta.codegolf.stackexchange.com/questions/1847/sandbox-for-proposed-challenges/2123#2123

Dictionary files can also be found in the dic folder.

##Tutorial

Do not read any further if you want to challenge yourself, or do the research yourself.

Let us consider the simple example again:

  • 成田 [Narita, name of a town]
  • なりた [na-ri-ta]
  • ,
  • [,なり],[,]

can be read なり (nari) and せい (sei), as (ta) and でん (den). The reading of 成田 is なりた (narita). Thus:

                  / た(TA)   ─ せいた(SEITA)
     / せい(SEI) ─ 田
    /             \ でん(DEN) ─ せいでん(SEIDEN)
   /
  /
成         
  \
   \
    \             / た(TA)   ─ なりた(NARITA)
     \ なり(NARI)─ 田
                  \ でん(DEN) ─ なりでん(NARIDEN)

We conclude that must be read なり (nari), and as (ta).

A slightly more advance example:

  • 成田空港 [Narita Airport]
  • なりたくうこう [na-ri-ta-ku-u-ko-u]
  • 成田, 空港
  • [成田,なりた] , [空港,くうこう]

Among others, can be read なり (nari), as (ta), as くう (kuu), and こう (kou). Therefore, the correct answer in this case is

  • [成田,なりた], [空港,くうこう]

成田 is read as なりた (narita), and 空港 as くうこう (kuukou)

###Implementing the basic feature

Hint: It is more or less a tree search. Each reading is a branch.

I have broken this feature down into four steps that are intended as a road map. I believe this will make it easier for you to get acquainted with the topic and the problem, but all you need to do is implement step 4. Feel free to implement this however you see fit.

Step 1

Traverse the tree.

Take the input as described above, and determine whether or not there exists at least one combination of each MOONGLYPH's reading, such that the READING may be obtained. Feel free to ignore the PARTs for this step.

Examples:

(1)

  • 山起
  • やまおこし [ya-ma-o-ko-shi]
  • 山起
  • True [because 山 = ya-ma; and 起 = o-ko-shi]

(2)

  • 静岡
  • つき [tsu-ki]
  • 静岡
  • False

The readings of are sei, jou, shizu, and shidzu.

The readings of are kou and oka.

There is no way to combined these reading and get tsu-ki. Thus, your program should indicate that there is no match.

Step 2

Remember your position in the tree.

Also output the reading for each MOONGLYPH. You may still ignore PARTs. This means

(1)

  • 山起
  • やまおこし [ya-ma-o-to-shi]
  • ,
  • [,やま], [,おこし]

(2)

  • 静岡
  • つき [tsu-ki]
  • ,
  • []

Step 3

Score +15

Traverse the entire tree, and collect all solutions.

Output every possible combination. PARTs may still be ignored.

  • 網岬 [this is not a real word, it's pretty hard to come up with something]
  • あみさき [a-mi-sa-ki]
  • ,
  • [[,あみ],[,さき]], [[,],[,みさき]]

Among others, can be read あみ (ami) and (a).

Among others, can be read みさき (misaki) and さき (saki).

Thus, either is read ami and saki; or is read a and misaki. Both possibilities would result in the combined reading amisaki for 網岬.

Step 4

Score +10

Output the reading for each PART, and not for each kanji. Remember that all PARTs combined, in the order they were given, will result in MOONGLYPH.

(1)

  • 岩手県宮古市
  • いわてけんみやこし [iwatekenmiyakoshi]
  • 岩手県,宮古市
  • [岩手県,いわてけん], [宮古市,みやこし]

岩手県 is read いわてけん (iwateken), and 宮古市 as みやこし (mi-ya-ko-shi).

(2)

A longer example, same principle as above.

  • 鹿児島県熊毛郡屋久島町宮之浦 [Kagoshima, "Bear Fur"-District, City of Yakushima, "Shrine Bay"]
  • かごしまけんくまげぐんやくしまちょうみやのうら
  • 鹿児島,熊毛郡,屋久島町,宮之浦
  • [鹿児島,かごしまけん], [熊毛郡,くまげぐん], [屋久島町,やくしまちょう], [宮之浦,みやのうら]

You can implement this by simply using the output from Step 3, joining theMOONGLYPHs corresponding to the parts.

###Parsing the dictionary

An entry from KANJIDIC looks like this:

月 376E N2169 [...] L13 Yyue4 Wweol ゲツ ガツ つき T1 おと がっ す ずき もり {moon}

You don't most of this information. All you need is the first column containing the MOONGLYPH; and all columns with KATAKANA or HIRAGANA - the READINGS. The entry above tells us that can be read as either ゲツ, ガツ, つき, おと, がっ, , ずき, or もり.

A reading may be specified by either KATAKANAs or HIRAGANAs, they are to be treated as equivalent. You probably want to change the case to either all HIRAGANA or KATAKANA.

KATAKANAs correspond to Chinese ON readings, HIRAGANA to native Japanese kun readings. This classification will be used by feature 10, ignore it for now. You may want to save this meta-information for later, however.

行 U884c Yxing4 [...] Whang コウ ギョウ アン い.く ゆ.く -ゆ.き -ゆき -い.き -いき
おこな.う おこ.なう T1 いく なみ なめ みち ゆき ゆく {going} {journey} {carry out}

Strip the dashes - from each each reading. Take only the part before dot . if one is present. Thus, the set of all possible readings for the above entry are:

コウ, ギョウ, アン, い, ゆ, ゆき, いき, おこな, おこ, いく, なみ, なめ, みち, ゆき, ゆく

See the documentation linked in the dictionary section for details.


##Feature 1

Explanation of Rendaku (Voicing).

The initial syllable of a MOONGLYPH other than the first one may get voiced, eg T -> D or K -> G. (ka) becomes (ga), and becomes (do). This is a complex phenomenon, but to keep it simple, we are going to assume that this voicing may always occur, except for a KANA in initial position.

(1)

  • 東田川郡
  • ひがしたがわぐん [hi-ga-shi-ta-ga-wa-gun]
  • ,田川郡
  • [,ひがし], [田川郡,たがわぐん]

In this example, only possesses the reading かわ [kawa]. In combination with other words or parts, in can voice to がわ [gawa]. Thus the reading たがわぐん [ta-ga-wa-gun] becomes possible for 田川郡.

(2)

  • 辺地町
  • へじまち [he-ji-ma-chi]
  • 辺地町
  • [辺地町,へじまち]

The reading [ji] for becomes voiced to form [ji]. Note that can become either [ji] or [dji].

(3)

  • 花見 (flower viewing, especially cherry trees)
  • ばなみ [ba-na-mi]
  • ,
  • no_match

can be read はな (hana), but not ばな (bana). Voicing ( -> ) is not possible, because occurs in initial position.

Here is a list of all possible alternations introduced by this Rendaku voicing (JSON):

{
  "か": [
    "が"
  ],
  "き": [
    "ぎ"
  ],
  "く": [
    "ぐ"
  ],
  "け": [
    "げ"
  ],
  "こ": [
    "ご"
  ],
  "さ": [
    "ざ"
  ],
  "し": [
    "じ"
  ],
  "す": [
    "ず"
  ],
  "せ": [
    "ぜ"
  ],
  "そ": [
    "ぞ"
  ],
  "た": [
    "だ"
  ],
  "ち": [
    "ぢ",
    "じ"
  ],
  "つ": [
    "づ",
    "ず"
  ],
  "て": [
    "で"
  ],
  "と": [
    "ど"
  ],
  "は": [
    "ば",
    "ぱ"
  ],
  "ひ": [
    "び",
    "ぴ"
  ],
  "ふ": [
    "ぶ",
    "ぷ"
  ],
  "へ": [
    "べ",
    "ぺ"
  ],
  "ほ": [
    "ぼ",
    "ぽ"
  ]
}

##Feature 2

Various orthographical punctuation symbols.

As I mentioned, That is, your program does not need to handle unmatched punctuation. Thus, examples (2) and (3) are not valid.

(1)

  • 瑞穂町(川澄)
  • みずほちょう(かわすみ)
  • 瑞穂町,(川澄)
  • [瑞穂町,みずほちょう], [(川澄),(かわすみ)]

and simply get copied over to the output.

(2)

  • 瑞穂町(川澄)
  • みずほちょうかわすみ
  • 瑞穂町,(川澄)
  • undefined

Invalid input, because READING does not contain any parentheses ().

(3)

  • 瑞穂町(川澄)
  • (みずほちょう)かわすみ
  • 瑞穂町,(川澄)
  • undefined

Invalid input, because the parentheses occur at different positions in MOONGLYPHs and READING. Depending on how you implement it, your program may or may not output something meaningful for example (2) and (3).

is the Japanese sign for zip or postal code. 『』 are Japanese quotation marks. 【】 are alternative quotation marks. is a separator, similar to a slash /.


##Feature 3

Add support the counter and old genitive marker , , and

They are not listed in KANJIDIC.

Consider the following three glyphs: , , and . They may possess the following readings (eg branches):

  • [ka]
  • [ga]

In addtion, and can also be read as

  • [ge]
  • [ko]

Note that, although ケ is the KATAKANA ke and thus pronounced ke, it is sometimes used instead of because of its graphical similarity - and can therefore be read as , , , or (rare) as well. Supporting KANAs in general is part of feature 8.

Example:

(1)

  • 戦場ヶ原 [Senjougahara, a character from BakeMonoGatari]
  • せんじょうがはら
  • 戦場,,
  • [戦場,せんじょう],[,],[,はら]

戦場 literally means battlefield, and means plains or field. is an old genitive or possession marker, and thus Senjōgahara can be translated as 'battlefield' but refers to a mythical battle of Mountain Gods, and not to any historical one.

(2)

  • 金ケ崎町
  • かねがさきちょう
  • , , ,
  • [,かね], [,], [,さき], [,ちょう]

(3)

  • 一ヵ月 [a period of one month, 一 = one, 月 = month]
  • いっかげつ
  • ,,
  • [,いっ], [,], [,げつ]

##Feature 4

Support omitted genitive markers between MOONGLYPHS. In old times, people didn't like to write them down, because MOONGLYPHs-only is way cooler.

(1)

  • 九戸郡
  • くのへぐん
  • , ,
  • [,], [<empty>,], [,], [,ぐん]

(2)

  • 甲斐守町
  • かいのかみちょう
  • 甲斐, ,
  • [甲斐,かい], [<empty>,], [,かみ], [,ちょう]

Technically, we would expect the spelling to be 九の戸郡. However, gets omitted at times, especially in proper nouns.

By the same token, gets omitted at times as well, the mechanics are exactly like .

(3)

  • 輿岡町 (City of Koshigaoka, "Palanquin Hill")
  • こしがおかちょう [ko-shi-ga-o-ka-cho-u]
  • 輿, ,
  • [輿, こし], [<empty>,], [,おか] , [, ちょう]

##Feature 5

Japanese Numerals

Add support for multi-digit numbers, when the READING includes the pronunciation of the number.

The easiest way to handle this is by converting the number to its MOONGLYPH representation, eg 120 => 百二十), and then proceed as normal

  • 2050番地
  • にせんごじゅうばんち
  • 2050,番地
  • [2050,にせんごじゅう], [番地,ばんち]

Convert 2050 to 二千五十, the readings ar found in the dictionary.

Just like we can say one-hundred two or hundred-two, (1) is optional before (hundred), (thousand), (1E4), and (1E8). That is, a number like 1111 may be equivalent to any of the following: 千百十一, 一百十一, 千一十一, or 一千一百十一.


##Feature 6

Support KANA, including the now deprecated , , , and .

Almost all kana can be ignored, like the punctuation symbols from feature 4.

Or to think about it another way, almost all KANAs represent themselves, and are read that way. (ni) is read (ni), and ロ(ro) is read (ro). Please remember that all READINGs are always given in HIRAGANA, but that is a convention for this challenge.

There are a few excpetion that date from before the orthographical reforms.

The four KANA (we), (wi), (we), and (wi) are read as (e), (i), (e), and (i), respectively.

Please note that while (ha) can also be pronounced wa, (he) as e, and (wo) as o, they will always appear as , , and in the READING , because these three excpetions still occur in contemporary Japanese. This means that you only need to care about the four kana above.

These four are read as , , , and , respectively.

(1)

  • 流通センター
  • りゅうつうせんたー
  • 流通, センター
  • [流通,りゅうつう], [センター,せんたー]

The READING of the KATAKANAs センター is given by the HIRAGANAs せんたー.

(2)

is read .

(3)

  • 月への旅 [journey to the moon]
  • つきへのたび
  • ,への,
  • [,つき], [への,への],[,]

月への旅 is pronounced as tsu-ki-e-no-ta-bi, but in modern Japanese orthography, this is still written as つきへのたび [tsu-ki-he-no-ta-bi]. Thus, does not require any special treatment.


##Feature 7

KANJIDIC2 is EUC-JP encoded. Many of these kanji are rarely used anymore, but they may appear at times, especially in proper names. Make sure your encoding support them.

  • (archaic word and moonglyph meaning you)
  • なんじ
  • [,なんじ]

##Feature 8

Each reading of a MOONGLYPH is categorized as either on (taken from Chinese) or kun (native Japanese word). As a rule of thumb, words are read either with on or kun readings only (exceptions abound). If there are multiple results, sort them such that those with more on-on and kun-kun readings are listed before mixed on-kun and kun-on readings. This is going to be heuristical, and requires a metric, see above.

Example:

  • 死糟鈴 [again, made up word because these cases are rare]
  • しぬかすず
  • , ,
  • [[,しぬ], [,かす], [,]], [[,], [,ぬか], [,すず]]

Both readings for and are KUN readings. For , しぬ is a KUN reading, and an ON reading. Thus, the likelihood is computed as follows:

  • KUN-KUN-KUN = 3 (しぬ,かす,)
  • ON-KUN-KUN = 1 (,ぬか,すず)

In the dictionary files, on readings are given as KATAKANA, kun readings HIRAGANA.


##Feature 9

The MOONGLYPH doubling sign . A word such as 木木 (trees) can be written as 木々. It may, but usually doesn't, appear multiple times. 木々 should be treated as equivalent to 木木, 人々 (people) as equivalent to 人人, and 藤原々々 as equivalent to 藤原原原 etc.

When the MOONGLYPH repeater 々 occurs m*n times, this may also be equivalent to the last n MOONGLYPHS occuring m times. In real-world examples, cases other than m=1 and especially n=2 are quite rare, but they may occur. The case above corresponds to n=1 and m=1

Consider 終了々々. The word 終了 consists of two characters and means (I am) finished. By adding two repeaters, 々々, the last n=2 MOONGLYPHs get repeated m=1 times. Thus, 終了々々 may also be equivalent to 終了終了. 終了々々々々々々 may be equivalent to 終了終了終了終了.

(1)

  • 鹿島代々木町 ["Deer" Island, City of Yoyogi "Tree of Ages"]
  • かしまよよぎちょう
  • 鹿島, 代々木町
  • [鹿島,かしま] [代々木町,よよぎちょう]

n=1, m=2.

(2)

  • 人々
  • ひとびと
  • ,
  • [,ひと], [,びと]

n=1, m=1.

Please note that if you did not implement feature 1, the output would be no match.

(3)

  • 藤原々々 [Japanese Illustrator]
  • ふじわらわらわら
  • , 原々々
  • [,ふじ] [原々々,わらわらわら]

n=1, m=2.

(4)

  • 終了々々々々 [emphatic repetition of finish]
  • しゅうりょうしゅうりょうしゅうりょう
  • 終了,々々
  • [終了,しゅうりょう], [々々,しゅうりょう]

Here, n=2, m=2.

(5)

  • 終了々々
  • しゅうりょうりょうりょう [not a real reading, only for illustration purposes]
  • 終了,々々
  • [終了,しゅうりょう], [々々,りょうりょう]

Here, n=2 and m=1.


##Feature 10

This feature assumes you have implemented KANAs, feature 8.

Add support for the voiced kana repeater . MOONGLYPHs shall not contain more than one in succession, ie they will never contain ゞゞ or ゞゞゞ, but they may contain かゞみかゞみ.

It shall occur only after a KATAKANA or HIRAGANA that can be, or has been, voiced. かゞみ (kagami) and がゞみ (gagami) are valid input strings, みゞ is not, because (mi) cannot be voiced. See the data given in feature 1.

かがみ (kagami, mirror) can be written as かゞみ. should be treated to the voiced version of the preceding KANA.

(1)

  • かゞみ (mirror)
  • かがみ
  • , ,
  • [,], [,], [,]

occurs after か (ka), which can be voiced to が (ga); and stands for this voiced syllable ga.

(2)

  • ジゞ (old man)
  • じじ (jiji)
  • ,
  • [,], [,]

occurs after (ji), which is the voiced version of シ (shi), and stands for this voiced syllable ji.


##Final Feature

Support special gikun readings by implenting words spanning multiple MOONGLYPHs. This means you will need a dictionary file.

大和 (an old name for Japan) is read as やまと (yamato), and it is more than the sum of its parts. This reading cannot be represented as a combination of the individual readings of each MOONGLYPH. The compound 大和 itself is read as やまと.

  • 宮崎市大和町
  • みやざきしやまとちょう
  • 宮崎市, 大和町
  • [宮崎市,みやざきし], [大和町,やまとちょう]

Up until know, you only had to consider the next MOONGLYPH when traversing the tree. This feature implies that you will need to take a look at the next n MOONGLYPHs as well, and add some branches if that word exists in the dictionary.

The tree might look like this:

                                               ちょう
                                             /
                  / わ(WA)   ─ だいわ(DAIWA)   ─ 町 ─ まち
     / だい(DAI) ─ 和
    /          |  \ わら(WARA)─ だいわら(DAIWARA)─ 町 ─ まち
   /           |                             \
  /            |                               ちょう
大             和町 NO_SUCH_ENTRY
| \
|  \
|   \               / ちょう(CHOU) ─ なりた(YAMATOCHOU)
|    \ 大和(YAMATO)─ 町
|                   \ まち(MACHI) ─ やまとまち(YAMATOMACHI)
|
大和町 NO_SUCH_ENTRY

Dictionary File EDict. Use either edict.gz or edict2.gz (custom format); or JMdict.gz or JMdict_e.gz (xml). The download page also contains links to the documentation of the dictionary format.

As with Kanjidic, you can ignore most of the information. The dictionary contains words and their kana reading. You may filter out any words that contain anything other than MOONGLYPHs. You may also want to filter those entries your programm can map to their reading already.