More parser gold, "the same author/work" references #191

cboulanger · 2022-08-11T12:03:31Z

Here is some more Parser gold which needs some more love because the source references are VERY messy and therefore the manual annotation were not always correct. I did quite a bit of manual correction after converting it from the EXparser format

zfrsoz-footnotes-corrected.xml.txt

If you spot any obvious mislabelings that could confuse the parser, please let me know. I am happy to repost the material after some more cleaning & correcting.

But here's my question: in German footnotes references (and alson sometimes in bibliographies), it is common to use backreferences to the previous footnote in the form of "ders." (the same author - male) or "dies." (the same author-female). In bibliographies, this sometimes appears in the form of "______". Or it is referred to the previously cited work with "op. cit.", "a.a.O.", etc.

Do you have any opinion on if/how AnyStyle could handle these cases - or should it be left to the postprocessing of CSL data?

cboulanger · 2022-08-11T16:44:31Z

Trying to train a model with this gold, I am getting

INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns
INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model
INFO [2022-08-11 18:37:57 +0200] wapiti: nb train:    1865
INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels:   13
INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks:   97424
INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624
INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs
ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.

cboulanger · 2022-08-13T07:21:55Z

Another question: for training, where should the token "in: " go, as in:

  <sequence>
    <author>N. Dimmel: </author>
    <title>Armutspotential zwischen Nichtinanspruchnahmeund Repression, </title>
    <editor>in: R. Teichmann (Hrsg.): </editor>
    <container-title>Sozialhilfe in Österreich, Wien </container-title>
    <date>1989</date>
  </sequence>
  <sequence>
    <author>V. Gessner: </author>
    <title>Rechtssoziologie und Rechtspraxis. Zur Rezeption empirischer Rechtsforschung, </title>
    <journal>in: Soziale Welt </journal>
    <volume>35 (</volume>
    <date>1984)</date>
  </sequence>

I assume it belongs into <editor> and <journal> and not as a suffix to the <title> but please let me know if that's a wrong assumption. Will it be removed by the normalizers?

cboulanger · 2022-08-13T07:25:34Z

I posted the current version (cleanup is still ongoing) to a gist: https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555

inukshuk · 2022-08-13T11:04:46Z

Yes, 'in' should definitely go with editors (it's a good marker!). The editor normalizer will strip it off. I'm not sure I've seen it often in the context of journals but we'd obviously follow the same approach there (would have to check if the journal normalizer already strips it though).

cboulanger · 2022-08-13T11:30:47Z

Any idea about the ruby: vmath.c:281: xvm_expma: Assertion 'r != NULL && ((uintptr_t)r % 16) == 0' failed. error?

inukshuk · 2022-08-13T16:03:02Z

Maybe an empty tag somewhere?

cboulanger · 2022-08-13T16:18:39Z

Is there a chance you could try to train a parser model with https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 to see if you get the error as well or if it is just my setup?

cboulanger · 2022-08-15T08:15:43Z

Trying to train a model with this gold, I am getting

INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns
INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model
INFO [2022-08-11 18:37:57 +0200] wapiti: nb train:    1865
INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels:   13
INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks:   97424
INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624
INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs
ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.

Any idea how I could debug this? I was trying to get an extended stack trace but to no avail. It would be so nice if I could get these two new xml training docs (1, 2) working with anystyle.

inukshuk · 2022-08-17T09:50:36Z

Looking only at the first of the linked datasets above, there are a few issues that cause wapiti to bail out. If you want to debug the native module you need to attach gdb however if a NULL assertion fails it's almost always the issue that you have an empty tag somewhere. In your dataset there are two empty <sequence/> tags and the file also includes two <dataset> elements which is not supported.

Here's a diff to make fix the first dataset:

*** /home/dupin/Downloads/zfrsoz-footnotes.xml	2022-08-17 11:05:56.104535376 +0200
--- zfrsoz-footnotes.xml	2022-08-17 11:36:27.720096975 +0200
***************
*** 6290,6296 ****
      <note>Mainz</note>
      <date>1982</date>
    </sequence>
-   <sequence/>
    <sequence>
      <author>Ministerium für Arbeit, Gesundheit und Sozialordnung:</author>
      <title>Die Situation der Frau in Baden-Württemberg,</title>
--- 6290,6295 ----
***************
*** 12850,12857 ****
      <volume>23/März</volume>
      <date>1990</date>
    </sequence>
- </dataset><?xml version='1.0' encoding='UTF-8'?>
- <dataset>
    <sequence>
      <editor>Armer/Grimshaw (Hrsg.), </editor>
      <title>Comparative Social Research Methodological Problems and Strategies (New York, London, Sydney, Tokio </title>
--- 12849,12854 ----
***************
*** 19142,19148 ****
      <note>Mainz </note>
      <date>1982</date>
    </sequence>
-   <sequence/>
    <sequence>
      <author>Ministerium für Arbeit, Gesundheit und Sozialordnung: </author>
      <title>Die Situation der Frau in Baden-Württemberg, </title>
--- 19139,19144 ----
***************
*** 25702,25705 ****
      <volume>23/März </volume>
      <date>1990</date>
    </sequence>
! </dataset>
\ No newline at end of file
--- 25698,25701 ----
      <volume>23/März </volume>
      <date>1990</date>
    </sequence>
! </dataset>

As a general observation, those datasets are very large. It's my feeling that it's better to have a smaller set with less inconsistencies than a larger set with more errors, though I don't have hard evidence to back this up. Smaller datasets make for quicker training so that's definitely a point in favor of a smaller model. What I'd suggest to do if you have such large sets is to train only a small subset first, then use that model to check the rest of the data. If there's a high error rate I'd make the training set larger. Once the error rate is low I'd only pick out those sequences that produce errors and add only those to the training set (or review them first, because errors can often point to inconsistencies in the marked up data).

Finally, as a general tip, you can usually spot errors in large datasets quickly by using a binary search approach: keep training with one half of the dataset until there's no error. This way you can usually limit the faulty section to a small set that's easily reviewable.

cboulanger · 2022-08-17T10:26:08Z

Thanks so much for looking into it and I am embarrassed that the xml contained junk - I did check for empty tags (but not on the <sequence> node) and I did try to validate but I must have used the wrong tool for it! Maybe in some future version a validation could be added that would immediately raise an error about invalid xml.

I'll break up the large xml into smaller parts based on the discipline (there's computer science, natural sciences, and social sciences in it), which might allow some interesting tests of the performance of a domain-specific vs. general-purpose dataset.

cboulanger · 2022-08-17T12:58:32Z

The multiple root problem was actually a copy/paste error when uploading the data as a gist, sorry. But removing the empty <sequence/> node and splitting up the big xml into three smaller ones did the trick! Thank you very much. All models are now trained!

I've put the individual parser training files in here:

https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 (German sociology of law, from footnotes)
https://gist.github.com/cboulanger/91964686c2d79cfb0c4984fe4a1ebb7c (Mainly German social science - from the excite project)
https://gist.github.com/cboulanger/415a4ed1cd1ced87efcf3b0afe629ba1 (Natural science - from the excite project)
https://gist.github.com/cboulanger/41ec3d26db9274b535ea95c0f41cb5da (Computer science - from the excite project)

I've put a lot of work into cleaning up and fixing the annotations, throwing out a large number of sequences which were poorly annotated. So at least in theory, the annotations should be of fairly high quality.

cboulanger · 2022-08-17T14:04:04Z

Ok, the performance, at least measured against gold.xml of this material isn't that great:

Model file test/models/parser-excite-computer-science.mod:
Checking gold.xml.................1252 seq  75.01%   5524 tok 15.26%  4s
Checking excite-computer-science.x  54 seq   1.48%    127 tok  0.13% 11s
Model file test/models/parser-excite-natural-science.mod:
Checking gold.xml.................1275 seq  76.39%   5958 tok 16.46%  4s
Checking excite-natural-science.xm   6 seq   0.79%     20 tok  0.09%  2s
Model file test/models/parser-excite-social-science.mod:
Checking gold.xml................. 945 seq  56.62%   3437 tok  9.49%  4s
Checking excite-social-science.xml 139 seq   2.82%    271 tok  0.26% 12s
Model file test/models/parser-zfrsoz-footnotes.mod:
Checking gold.xml.................1620 seq  97.06%   9073 tok 25.06%  4s
Checking zfrsoz-footnotes.xml..... 113 seq   5.97%    232 tok  0.84%  3s

The consistency of the annotations seems to be quite good, as seen when the model is checked against its own training material.

inukshuk · 2022-08-17T15:00:27Z

Well those dataset differ considerably from the data in gold.xml so I wouldn't expect it to match it very well? I'd definitely check out the inconsistencies (by creating a delta dataset) because you have a few hundred inconsistently labeled references there. For comparison, between gold.xml and core.xml we usually have only a handful (and those are often difficult cases like container-title vs journal or director vs editor etc.).

That said, if you're looking for a combined model that gives good results for both datasets, I'd add something between 50-250 footnote references (aiming for a representative sample of course) to the core set and use that to train the model. Adding more footnote references as necessary.

cboulanger mentioned this issue Aug 17, 2022

Finder & parser gold from the EXcite project #190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More parser gold, "the same author/work" references #191

More parser gold, "the same author/work" references #191

cboulanger commented Aug 11, 2022

cboulanger commented Aug 11, 2022

cboulanger commented Aug 13, 2022

cboulanger commented Aug 13, 2022

inukshuk commented Aug 13, 2022

cboulanger commented Aug 13, 2022 •

edited

inukshuk commented Aug 13, 2022

cboulanger commented Aug 13, 2022

cboulanger commented Aug 15, 2022 •

edited

inukshuk commented Aug 17, 2022

cboulanger commented Aug 17, 2022

cboulanger commented Aug 17, 2022

cboulanger commented Aug 17, 2022

inukshuk commented Aug 17, 2022

More parser gold, "the same author/work" references #191

More parser gold, "the same author/work" references #191

Comments

cboulanger commented Aug 11, 2022

cboulanger commented Aug 11, 2022

cboulanger commented Aug 13, 2022

cboulanger commented Aug 13, 2022

inukshuk commented Aug 13, 2022

cboulanger commented Aug 13, 2022 • edited

inukshuk commented Aug 13, 2022

cboulanger commented Aug 13, 2022

cboulanger commented Aug 15, 2022 • edited

inukshuk commented Aug 17, 2022

cboulanger commented Aug 17, 2022

cboulanger commented Aug 17, 2022

cboulanger commented Aug 17, 2022

inukshuk commented Aug 17, 2022

cboulanger commented Aug 13, 2022 •

edited

cboulanger commented Aug 15, 2022 •

edited