Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More parser gold, "the same author/work" references #191

Open
cboulanger opened this issue Aug 11, 2022 · 13 comments
Open

More parser gold, "the same author/work" references #191

cboulanger opened this issue Aug 11, 2022 · 13 comments

Comments

@cboulanger
Copy link
Contributor

Here is some more Parser gold which needs some more love because the source references are VERY messy and therefore the manual annotation were not always correct. I did quite a bit of manual correction after converting it from the EXparser format

zfrsoz-footnotes-corrected.xml.txt

If you spot any obvious mislabelings that could confuse the parser, please let me know. I am happy to repost the material after some more cleaning & correcting.

But here's my question: in German footnotes references (and alson sometimes in bibliographies), it is common to use backreferences to the previous footnote in the form of "ders." (the same author - male) or "dies." (the same author-female). In bibliographies, this sometimes appears in the form of "______". Or it is referred to the previously cited work with "op. cit.", "a.a.O.", etc.

Do you have any opinion on if/how AnyStyle could handle these cases - or should it be left to the postprocessing of CSL data?

@cboulanger
Copy link
Contributor Author

Trying to train a model with this gold, I am getting

INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns
INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model
INFO [2022-08-11 18:37:57 +0200] wapiti: nb train:    1865
INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels:   13
INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks:   97424
INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624
INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs
ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.

@cboulanger
Copy link
Contributor Author

Another question: for training, where should the token "in: " go, as in:

  <sequence>
    <author>N. Dimmel: </author>
    <title>Armutspotential zwischen Nichtinanspruchnahmeund Repression, </title>
    <editor>in: R. Teichmann (Hrsg.): </editor>
    <container-title>Sozialhilfe in Österreich, Wien </container-title>
    <date>1989</date>
  </sequence>
  <sequence>
    <author>V. Gessner: </author>
    <title>Rechtssoziologie und Rechtspraxis. Zur Rezeption empirischer Rechtsforschung, </title>
    <journal>in: Soziale Welt </journal>
    <volume>35 (</volume>
    <date>1984)</date>
  </sequence>

I assume it belongs into <editor> and <journal> and not as a suffix to the <title> but please let me know if that's a wrong assumption. Will it be removed by the normalizers?

@cboulanger
Copy link
Contributor Author

I posted the current version (cleanup is still ongoing) to a gist: https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555

@inukshuk
Copy link
Owner

Yes, 'in' should definitely go with editors (it's a good marker!). The editor normalizer will strip it off. I'm not sure I've seen it often in the context of journals but we'd obviously follow the same approach there (would have to check if the journal normalizer already strips it though).

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 13, 2022

Any idea about the ruby: vmath.c:281: xvm_expma: Assertion 'r != NULL && ((uintptr_t)r % 16) == 0' failed. error?

@inukshuk
Copy link
Owner

Maybe an empty tag somewhere?

@cboulanger
Copy link
Contributor Author

Is there a chance you could try to train a parser model with https://gist.github.com/cboulanger/9417648552d775d523d6961d575bc555 to see if you get the error as well or if it is just my setup?

@cboulanger
Copy link
Contributor Author

cboulanger commented Aug 15, 2022

Trying to train a model with this gold, I am getting

INFO [2022-08-11 18:37:56 +0200] wapiti: load patterns
INFO [2022-08-11 18:37:57 +0200] wapiti: initialize model
INFO [2022-08-11 18:37:57 +0200] wapiti: nb train:    1865
INFO [2022-08-11 18:37:57 +0200] wapiti: nb labels:   13
INFO [2022-08-11 18:37:57 +0200] wapiti: nb blocks:   97424
INFO [2022-08-11 18:37:57 +0200] wapiti: nb features: 1274624
INFO [2022-08-11 18:37:57 +0200] wapiti: training model with l-bfgs
ruby: vmath.c:281: xvm_expma: Assertion `r != NULL && ((uintptr_t)r % 16) == 0' failed.

Any idea how I could debug this? I was trying to get an extended stack trace but to no avail. It would be so nice if I could get these two new xml training docs (1, 2) working with anystyle.

@inukshuk
Copy link
Owner

Looking only at the first of the linked datasets above, there are a few issues that cause wapiti to bail out. If you want to debug the native module you need to attach gdb however if a NULL assertion fails it's almost always the issue that you have an empty tag somewhere. In your dataset there are two empty <sequence/> tags and the file also includes two <dataset> elements which is not supported.

Here's a diff to make fix the first dataset:

*** /home/dupin/Downloads/zfrsoz-footnotes.xml	2022-08-17 11:05:56.104535376 +0200
--- zfrsoz-footnotes.xml	2022-08-17 11:36:27.720096975 +0200
***************
*** 6290,6296 ****
      <note>Mainz</note>
      <date>1982</date>
    </sequence>
-   <sequence/>
    <sequence>
      <author>Ministerium für Arbeit, Gesundheit und Sozialordnung:</author>
      <title>Die Situation der Frau in Baden-Württemberg,</title>
--- 6290,6295 ----
***************
*** 12850,12857 ****
      <volume>23/März</volume>
      <date>1990</date>
    </sequence>
- </dataset><?xml version='1.0' encoding='UTF-8'?>
- <dataset>
    <sequence>
      <editor>Armer/Grimshaw (Hrsg.), </editor>
      <title>Comparative Social Research Methodological Problems and Strategies (New York, London, Sydney, Tokio </title>
--- 12849,12854 ----
***************
*** 19142,19148 ****
      <note>Mainz </note>
      <date>1982</date>
    </sequence>
-   <sequence/>
    <sequence>
      <author>Ministerium für Arbeit, Gesundheit und Sozialordnung: </author>
      <title>Die Situation der Frau in Baden-Württemberg, </title>
--- 19139,19144 ----
***************
*** 25702,25705 ****
      <volume>23/März </volume>
      <date>1990</date>
    </sequence>
! </dataset>
\ No newline at end of file
--- 25698,25701 ----
      <volume>23/März </volume>
      <date>1990</date>
    </sequence>
! </dataset>

As a general observation, those datasets are very large. It's my feeling that it's better to have a smaller set with less inconsistencies than a larger set with more errors, though I don't have hard evidence to back this up. Smaller datasets make for quicker training so that's definitely a point in favor of a smaller model. What I'd suggest to do if you have such large sets is to train only a small subset first, then use that model to check the rest of the data. If there's a high error rate I'd make the training set larger. Once the error rate is low I'd only pick out those sequences that produce errors and add only those to the training set (or review them first, because errors can often point to inconsistencies in the marked up data).

Finally, as a general tip, you can usually spot errors in large datasets quickly by using a binary search approach: keep training with one half of the dataset until there's no error. This way you can usually limit the faulty section to a small set that's easily reviewable.

@cboulanger
Copy link
Contributor Author

Thanks so much for looking into it and I am embarrassed that the xml contained junk - I did check for empty tags (but not on the <sequence> node) and I did try to validate but I must have used the wrong tool for it! Maybe in some future version a validation could be added that would immediately raise an error about invalid xml.

I'll break up the large xml into smaller parts based on the discipline (there's computer science, natural sciences, and social sciences in it), which might allow some interesting tests of the performance of a domain-specific vs. general-purpose dataset.

@cboulanger
Copy link
Contributor Author

The multiple root problem was actually a copy/paste error when uploading the data as a gist, sorry. But removing the empty <sequence/> node and splitting up the big xml into three smaller ones did the trick! Thank you very much. All models are now trained!

I've put the individual parser training files in here:

I've put a lot of work into cleaning up and fixing the annotations, throwing out a large number of sequences which were poorly annotated. So at least in theory, the annotations should be of fairly high quality.

@cboulanger
Copy link
Contributor Author

Ok, the performance, at least measured against gold.xml of this material isn't that great:

Model file test/models/parser-excite-computer-science.mod:
Checking gold.xml.................1252 seq  75.01%   5524 tok 15.26%  4s
Checking excite-computer-science.x  54 seq   1.48%    127 tok  0.13% 11s
Model file test/models/parser-excite-natural-science.mod:
Checking gold.xml.................1275 seq  76.39%   5958 tok 16.46%  4s
Checking excite-natural-science.xm   6 seq   0.79%     20 tok  0.09%  2s
Model file test/models/parser-excite-social-science.mod:
Checking gold.xml................. 945 seq  56.62%   3437 tok  9.49%  4s
Checking excite-social-science.xml 139 seq   2.82%    271 tok  0.26% 12s
Model file test/models/parser-zfrsoz-footnotes.mod:
Checking gold.xml.................1620 seq  97.06%   9073 tok 25.06%  4s
Checking zfrsoz-footnotes.xml..... 113 seq   5.97%    232 tok  0.84%  3s

The consistency of the annotations seems to be quite good, as seen when the model is checked against its own training material.

@inukshuk
Copy link
Owner

Well those dataset differ considerably from the data in gold.xml so I wouldn't expect it to match it very well? I'd definitely check out the inconsistencies (by creating a delta dataset) because you have a few hundred inconsistently labeled references there. For comparison, between gold.xml and core.xml we usually have only a handful (and those are often difficult cases like container-title vs journal or director vs editor etc.).

That said, if you're looking for a combined model that gives good results for both datasets, I'd add something between 50-250 footnote references (aiming for a representative sample of course) to the core set and use that to train the model. Adding more footnote references as necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants