Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to best deal with signal words to separate concatenated references #194

Open
cboulanger opened this issue Sep 1, 2022 · 7 comments
Open

Comments

@cboulanger
Copy link
Contributor

I work with references in footnotes, which often have several references in one line - i.e. they are not cleanly separated like references in separate bibliographies, There are a number of signal words and punctuation which for the human reader make it very clear where one citation starts and ends, but it is hard to figure out the exact rules for separating them so that AnyStyle the parser can do its magic.

The semicolon as separator already goes a long way because it is normally not part of any citation style, however, semicolons can also be found in titles. I came up with a couple of regular expressions but of course like always with regexes, you have to cover each and every case and there will always be false poisitives

{
    START_CITATION: [
      /(see |cf\.? |e\.g\. |accord )(also )?/ig,
      /(siehe |vgl. |näher |etwa | beispielsweise )(dazu |hierzu )?(etwa |näher )?(auch)?/gi,
      /(dazu |hierzu )(etwa |näher )?(auch)?/gi,
      /(anders etwa |ähnlich auch )?/gi,
      /(sowie )(bei )?/gi
    ],
    END_CITATION: [
      /([\d]+\s*)(-\s*)?([\d]+\s*)?(f\.?|ff\.?| (et |and )?passim)\s*([;.]\s*)/
    ]
  }

I wondered if it would make sense to train a separate model just with these words to preprocess the raw reference lines. What would be your approach to dealing with this problem?

Thank you.

@inukshuk
Copy link
Owner

inukshuk commented Sep 1, 2022

I think, in general, footnotes should probably parsed with a separate parser model that supports multiple references per sequence. I think the current parser model only has one or two features which take into account the position of a token in the input sequence (this way, for example, words towards the beginning could be weighted more towards being part of author names etc.). I doubt that these features are extremely important and they could be dropped in a multi-reference model.

The other thing that would have to be changed are some normalizers that currently combine or re-label if there are multiple segments with the same label. And of course the decision about which segments constitute a single reference would have to be be made -- but instead of your current situation you'd have all the labels to work with, which should make it much easier (e.g., if you encounter one of the stop words above and you already have author, title, and year, and the thing after the stop word is again an author it's easy to make the call to make it two separate references).

Of course you could also try to use the existing parser for this: just train it with footnotes containing multiple references and use the XML output and then group the segments yourself. Once you've separated the references you can just pass them to the normalizer. In other words, you'd use the existing parser (but trained on multiple references per sequence) and instead of letting it label and normalize everything, you let it apply the labels, then you separate the segments into groups belonging to a single reference and pass each group to the normalizers. You'd still need to solve the same problem, but in addition you'd have a label for each word in your footnote applied by the CRF model.

@cboulanger
Copy link
Contributor Author

Thank you for your thoughts. Probably it would make sense to introduce a new tag for these stop/signal words, since that - if correclty tagged by the parser - would help to find the beginning of new references. BTW, how does the learning work in this respect: would it suffice to have a list of tagged stop words in the training material (i.e. is it enough to occur once), or does frequency matter, i.e. they would have to appear, say, 20 times in the sequences to be correctly tagged as stop words?

@inukshuk
Copy link
Owner

inukshuk commented Sep 1, 2022

If you train a model it will know about all the labels in the set, even if a label occurs just once.

For footnotes I would expect a lot of next unrelated to references; in the current core model I think we normally use note if there's some non-pertinent (or otherwise difficult to label) text -- using note mainly because it's also a convenient way to preserve it since many citation formats have a note field. However, for footnotes you might really want to be able to drop unrelated text so I'd start by adding one extra label, using it for the stop words and other unrelated sections.

@cboulanger
Copy link
Contributor Author

Thanks, I'll try that. BTW to see what I am working on, I made a little screencast that showcases a web frontend to AnyStyle. If you are interested: https://owncloud.gwdg.de/index.php/s/u8AcKYwTn1F9PkL The end shows that there are still bugs.

@inukshuk
Copy link
Owner

inukshuk commented Sep 4, 2022

Pretty cool, thanks for sharing!

@cboulanger
Copy link
Contributor Author

I am experimenting with synthetic training data (with these signal words randomly inserted before the main reference) but the results aren't very good. Even though I have hundreds of training sequences such as, for example, <signal>Vgl.</signal><author>Müller, Heinz</author>..., the model will still often output {:author:[{:family:"Vgl. Müller", :given:"Heinz".... I wonder if I will be able to avoid to add a dictionary feature for this... But I keep on experimenting, maybe manually annotated material will work better than the synthetic one...

@inukshuk
Copy link
Owner

inukshuk commented Sep 9, 2022

Like I said, I don't think that an additional feature would help in this case. In general, if a word like Vgl. is tagged consistently in the training data I'd be extremely surprised if the predictor would mislabel it.

One thing that may be happening is that if you insert the word at the beginning in your synthetic references when training, that the model recognizes it only when it's at the start of the reference (this might be even more likely if you have other occurrences of 'vgl.' labelled differently elsewhere in the training set).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants