Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should we refactor for better performance #28

Open
asimkt opened this issue Dec 12, 2022 · 2 comments
Open

should we refactor for better performance #28

asimkt opened this issue Dec 12, 2022 · 2 comments
Labels
Type: Documentation Documentation only changes

Comments

@asimkt
Copy link

asimkt commented Dec 12, 2022

Since Intl.Segmenter (link) is available for most of the users, I think it's better to mention what are the extra features [sentence-splitter](https://github.com/azu/sentence-splitter) is providing. And it will be better to tell to use one over another with some scenarios so that user can take an informed decision.

@azu azu added the Type: Documentation Documentation only changes label Dec 14, 2022
@azu
Copy link
Member

azu commented Dec 14, 2022

Intl.Segmenter is tokenizer, it split text into words(tokens).
So, it does not split text into sentences.

segments excalidraw

https://excalidraw.com/#json=PKbVfkc-JwDScvZHN8Ysu,jgqaHTMS03q6BsvQRYO1dg

@danielweck
Copy link

Intl.Segmenter is tokenizer, it split text into words(tokens).

This statement is correct ... but paints an incomplete picture :)

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter/Segmenter#granularity

Option granularity accepts value sentence (not just word).


const text = "吾輩は猫である。名前はたぬき。";
const japaneseWordSegmenter = new Intl.Segmenter("ja-JP", { granularity: "word" });
[...japaneseWordSegmenter.segment(text)].forEach((s)=>console.log(JSON.stringify(s, null, 4)));

==>

{
    "segment": "吾輩",
    "index": 0,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "は",
    "index": 2,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "猫",
    "index": 3,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "で",
    "index": 4,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "ある",
    "index": 5,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "。",
    "index": 7,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": false
}
{
    "segment": "名前",
    "index": 8,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "は",
    "index": 10,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "たぬき",
    "index": 11,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": true
}
{
    "segment": "。",
    "index": 14,
    "input": "吾輩は猫である。名前はたぬき。",
    "isWordLike": false
}

const text = "吾輩は猫である。名前はたぬき。";
const japaneseSentenceSegmenter = new Intl.Segmenter("ja-JP", { granularity: "sentence" });
[...japaneseSentenceSegmenter.segment(text)].forEach((s)=>console.log(JSON.stringify(s, null, 4)));

==>

{
    "segment": "吾輩は猫である。",
    "index": 0,
    "input": "吾輩は猫である。名前はたぬき。"
}
{
    "segment": "名前はたぬき。",
    "index": 8,
    "input": "吾輩は猫である。名前はたぬき。"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Documentation Documentation only changes
Projects
None yet
Development

No branches or pull requests

3 participants