Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Supplementary Character issues #48

Open
carrielui opened this issue Feb 1, 2017 · 2 comments
Open

Unicode Supplementary Character issues #48

carrielui opened this issue Feb 1, 2017 · 2 comments
Labels

Comments

@carrielui
Copy link

carrielui commented Feb 1, 2017

There are some characters which are valid in unicode UTF-8, but after I add to the dictionary, it has this error.
../inst/include/lib/DictTrie.hpp:130 ERROR Decode 的𠝹刀 failed.
This character is

General category: Lo - Letter, other
Canonical combining class: 0 - Spacing, split, enclosing, reordrant, & Tibetan subjoined
Bidirectional category: L - Left-to-right
Unicode version:
As text: 𠝹
Decimal: 132985
HTML escape: 𠝹
URL escape: %F0%A0%9D%B9
More alternative forms
View data in UniHan database
View in PDF code charts (page 34, approx [40Mb file!])
More properties at CLDR's Property demo
Descriptions at decodeUnicode
Java data at FileFormat
Unicode block: CJK Unified Ideographs Extension B
Script group: undefined

@qinwf qinwf added the bug label Feb 2, 2017
@qinwf
Copy link
Owner

qinwf commented Feb 2, 2017

Thanks for the report. I can reproduce it. Working on it.

qinwf pushed a commit that referenced this issue Feb 2, 2017
@qinwf
Copy link
Owner

qinwf commented Feb 2, 2017

The master branch updated. You can test this test case:

library(jiebaR)
library(testthat)

test_that("#48", {
  dd = tempfile()
  writeLines("的𠝹刀 n\n\n",con = dd)
  cc = worker(user = dd)
  expect_equal(cc["的𠝹刀"],"的𠝹刀")
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants