Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer doesn't parse Volume/issue typeset with no space. #212

Open
EmmanuelCharpentier opened this issue May 17, 2023 · 3 comments
Open

Comments

@EmmanuelCharpentier
Copy link

Germane to #23 : when given such a reference :

1.	Felson DT. Epidemiology of hip and knee osteoarthritis. Epidemiol Rev. 1988;10:1‑28. 

the current parser tokenizes 1988;10:1‑28 as a whole and assigns it to Volume/Issue. It should be approximately

Token Value
Year 1998
Volume 10
Pages 1-28

Worse case :

2.	Heijink A, Gomoll AH, Madry H, Drobnič M, Filardo G, Espregueira-Mendes J, et al. Biomechanical considerations in the pathogenesis of osteoarthritis of the knee. Knee Surg Sports Traumatol Arthrosc. mars 2012;20(3):423‑35. 

Is parsed as :

Token Value
Citation number 2
[ Author Heijink A, Gomoll AH, Madry H, Drobnič M, Filardo G, Espregueira-Mendes J, et al.
Title Biomechanical considerations in the pathogenesis of osteoarthritis of the knee
Journal Knee Surg Sports Traumatol Arthrosc mars
Date 2012
Volume/Issue 20(3):423‑35

Again, the whoele Volume/issue token isn't parsed for punctuation. I would expect :

Token Value
Citation number 2
[ Author Heijink A, Gomoll AH, Madry H, Drobnič M, Filardo G, Espregueira-Mendes J, et al.
Title Biomechanical considerations in the pathogenesis of osteoarthritis of the knee mars
Journal Knee Surg Sports Traumatol Arthrosc
Date mars 2012
Volume/Issue 20(3)
Pages 423‑35

Recognizing mars 2012 is probably harder...

HTH,

@inukshuk
Copy link
Owner

Thanks. We should add the examples to the the volume normalizer test cases.

@EmmanuelCharpentier
Copy link
Author

Thanks. We should add the examples to the the volume normalizer test cases.

Are you interested by larger dubious test cases, ? have some of them on hand ;-]...

@inukshuk
Copy link
Owner

inukshuk commented May 17, 2023

Definitely. Especially if you could provide them in the test case format.

Basically you'd want something like:

'2012;20(3):423‑35.' => { volume: ['20'], issue: ['3'], date: ['2012'], pages: ['423-35'] }
'1988;10:1‑28.' => { volume: ['10'], issue: ['1' ], pages: ['423-35'] }

And similarly:

'mars 2012;20(3):423‑35.' => { volume: ['20'], issue: ['3'], date: ['2012-03'], pages: ['423-35'] }

Though this one has another aspect to it. Here we should also add some samples using this style to the core training data.

For example:

<citation-number>2.</citation-number>
<author>Heijink A, Gomoll AH, Madry H, Drobnič M, Filardo G, Espregueira-Mendes J, et al.</author>
<title>Biomechanical considerations in the pathogenesis of osteoarthritis of the knee.</title>
<journal>Knee Surg Sports Traumatol Arthrosc.</journal>
<volume>mars 2012;20(3):423‑35.</volume> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants