Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the rules for putting spaces around prefixes and suffixes? #13

Open
ndw opened this issue May 24, 2020 · 23 comments
Open

What are the rules for putting spaces around prefixes and suffixes? #13

ndw opened this issue May 24, 2020 · 23 comments

Comments

@ndw
Copy link

ndw commented May 24, 2020

I can't find any discussion in the specification of when spaces are added before and/or after prefixes and/or suffixes. Indeed some of the examples in the spec appear to explicitly make spaces part of the prefix and/or suffix. (Apologies if I've overlooked the relevant part of the spec.)

In affix_PrefixFullCitationTextOnly.txt, the expected result is

(As written in Smith 2000)

By my naive interpretation of the test, I think the output should be

(As written inSmith 2000)

There is no trailing space in the item prefix.

@fbennett
Copy link
Member

Yes, this is beyond the CSL specification. The citeproc-js processor adds implicit trailing space to cite prefixes when joining on "romanesque" characters, and similarly adds implicit leading space on cite suffixes unless the leading character is a punctuation mark. (If there are objections, tests that exercise this behaviour could be removed from the CSL test suite.)

@ndw
Copy link
Author

ndw commented May 24, 2020

For what it's worth, I think any behavior that is "beyond the specification" should cause the test to fail. The specification should demand a trailing space or forbid a trailing space. If the specification is ambiguous, that's a bug in the spec.

@bdarcus
Copy link
Member

bdarcus commented May 24, 2020

I agree. If the spec is unclear now, but it's straightforward to fix, we should do that.

If there's some debate, then we should remove from the test suite, or maybe put it in some "pending" directory or something.

Flagging this for a few people, since it bears on the clarity of the spec. Would be nice if we could plug these holes are Norm is identifying them, and we're working on resolving lingering issues on the schema.

@rmzelle @adam3smith @bwiernik @denismaier

@bwiernik
Copy link
Member

@fbennett I know that citeproc-js has a fair amount of these sorts of behaviors. Do you have a suggestion for how we might go through and identify them systematically?

@adam3smith
Copy link
Member

For what it's worth, I think any behavior that is "beyond the specification" should cause the test to fail. The specification should demand a trailing space or forbid a trailing space. If the specification is ambiguous, that's a bug in the spec.

For the general issue: we're well aware that there are many such cases. Given the amount of attention to detail and fixes for specific issues have gone into citeproc-js and thus the test suite, I'd be surprised if there were fewer than 50 such tests that cover issues not made explicit in the specifications and I think we actually have some language somewhere that says the tests are right where the specification doesn't cover something. Translating these into specifications would definitely be welcome (and something Rintze and I have talked about for a while), but it's a massive effort and is probably best be done piecemeal, lest it delays a spec update by another couple of years.

For the specific issue, though, I don't believe the specification even addresses prefix/suffix on cites. We could decide to change that (although for this particular case I'm not certain we have to*), but unless we do, the correct interpretation is that the CSL spec sees this as a downstream design implementation and citeproc-js made a specific call.

  • because the way affixes are assigned to cites is typically integrated into the data model to communicate citation information to a specific processor, the case for us to standardize this is weaker than for issues where identical CSL-JSON may produce different outputs.

@bdarcus
Copy link
Member

bdarcus commented May 24, 2020

For the specific issue, though, I don't believe the specification even addresses prefix/suffix on cites. We could decide to change that ...

Indeed, there's been recent discussion of just this.

@ndw
Copy link
Author

ndw commented May 24, 2020

If CSL is defined by the citeproc-js implementation, the spec should say that. If it isn't, then the test suite should reflect what the spec says, not what the citeproc-js implementation does.

I want to generate citations and bibliographic entries from tagged data. Hand coding every possible format is impractical. Having some kind of abstraction seems like a much better approach.

CSL was suggested so I started doing a little implementation work, just to get a sense for how much effort was involved and whether any sort of incremental approach would be useful. If the challenge is "implement the spec", that might be hard work and might require clarifications in the spec (there's nothing like a new implementor to reveal where the spec isn't clear). If the challenge is "implement a processor that's compatible with the output of citeproc-js", that's a whole different level of effort and complexity.

If the spec doesn't completely and independently define the language, then it's not a spec, it's a collection of notes that go with the reference implementation. I'm not saying that's bad or wrong or not what you intended, but it's not a solid foundation for new implementations.

@adam3smith
Copy link
Member

The specification is a complete specification of the syntax of CSL styles, i.e. (mostly) exhaustive for people interesting in coding styles.
The specification is indeed not an even remotely complete specification of the expected output to be produced by CSL processors for all input, i.e. not close to exhaustive for people coming from your perspective. I'm sorry if you found the labeling here misleading.

To facilitate implementation we have assumed what was indeed originally the citeproc-js test-suite into the CSL set of repositories, since citeproc-js is the oldest citeproc and the vast majority of CSL styles have been tested against it; so passing the test suite will guarantee expected outcome for existing CSL styles. For now, that's the best we can offer. As per the above, it'd be great to have the spec be complete, but the amount of work that would entail is massive and I'd be (pleasantly) surprised if it could happen in the next 12 month. At a minimum, it would require sustained effort by multiple people beyond labor that's been available to CSL over the last ~5 years.
There are multiple citeprocs that have been developed and continue to be developed, so this isn't impossible, but it'd certainly be understandable if it's not feasible given other existing priorities.

@bdarcus
Copy link
Member

bdarcus commented May 24, 2020

To facilitate implementation we have assumed what was indeed originally the citeproc-js test-suite into the CSL set of repositories, since citeproc-js is the oldest citeproc and the vast majority of CSL styles have been tested against it; so passing the test suite will guarantee expected outcome for existing CSL styles. For now, that's the best we can offer. As per the above, it'd be great to have the spec be complete, but the amount of work that would entail is massive ...

Maybe, then, we need explicitly say this in the right places; at the top of the README for the test suite, for example?

It does seems like a strategy to address this over time is to first identify those tests that are inconsistent with the spec, so we at least know where the current gaps are, can note them so implementers are aware, and then fix them as we can?

@adam3smith
Copy link
Member

Maybe, then, we need explicitly say this in the right places; at the top of the README for the test suite, for example?

Yes, that seems right to me. We say something along those lines on the website already: https://citationstyles.org/developers/ and I think similar or even more explicit text in the introduction of the test-suite makes a lot of sense.

It does seems like a strategy to address this over time is to first identify those tests that are inconsistent with the spec, so we at least know where the current gaps are, can note them so implementers are aware, and then fix them as we can?

That also seems right, though I think "inconsistent" really isn't the right term for a lot of them. "Aren't current addressed in..." makes more sense.

I would also at least consider whether this should all go in the same document. A lot of these tests are descriptions of the kind "in situation XYZ, a citeproc should..." -- the current specifications are a document that principally describes "to code a citation style, you should...". As I suggest above, the two aren't designed for the same user group and I'm somewhat concerned that, by making all processor behavior explicit in the specs, they become virtually unusable for style authors (they're intimidating enough for that purpose already).

@bdarcus
Copy link
Member

bdarcus commented May 25, 2020 via email

@fbennett
Copy link
Member

fbennett commented May 25, 2020

Two observations that I hope will not further complicate the discussion.

  1. While it doesn't have any bearing on decisions over content of the CSL specification, adding a trailing space to the prefix in affix_PrefixFullCitationTextOnly.txt test is fine as far as citeproc-js goes: it passes the test either way. The same goes for the prefix attribute in any similar tests.
  2. An important item not covered in the CSL specification, but very much relevant to any implementation of the language is that styles in the CSL repository very often set joining punctuation in affixes. When elements are omitted, this can result in duplicate punctuation or spaces, when the suffix on one element is joined directly to the prefix on another. When the problem was raised by Andrea Rossato, one of two choices had to be made: to control for such "accidental" duplicate punctuation in the implementations; or to address the issue in the styles by refactoring them to set joining punctuation exclusively as delimiters. At that point in time, there were a large number of styles in circulation, and insufficient resources to refactor all of them (since that would involve rewriting portions of each). So code was introduced into citeproc-js to eliminate such accidental duplicate punctuation and spaces at render-time. Any implementation that processes existing styles faces the same choice between cleanup in the output or refactoring of the entire style repository. (The CSL-M variant of the language, which was spun off soon after, did not have the weight of existing styles to content with, and so had the luxury of prohibiting trailing space on suffixes and leading space on prefixes in its syntax definition.)

@ndw
Copy link
Author

ndw commented May 25, 2020

I completely understand the necessity of making pragmatic choices. However, at the risk of being perceived as argumentative, the behavior of citeproc-js is either conformant to the specification or it is not. The specification must make that clear or it isn't useful to prospective implementors.

I don't care (I do in fact care, but my personal opinions aren't relevant to this discussion) how this is resolved, but I think there are two choices:

  1. The citeproc-js implementation is conformant. The spec must be changed to describe in precise detail how adjacent "duplicate" punctuation is to be removed.
  2. The implementation is not conformant. The test results must be changed to reflect conformant results and implementations should fail tests where their behavior is non-conformant.

If the corpus is too large to refactor, put a version number on the style element and say that removing adjacent punctuation is manditory on "old" styles and forbidden on "new" ones.

@bwiernik
Copy link
Member

bwiernik commented May 25, 2020

I think that ultimately producing style-authoring and citeproc-implementing spec documents will be valuable.

In the interim, explicitly stating that the test suite results should considered canonical in cases where the spec is ambiguous would address much of this issue. (For other cases where Norman notes CSL-M-specific content in tests, these should be explicitly labeled as such or removed.)

@denismaier
Copy link
Member

I think that ultimately producing style-authoring and citeproc-implementing spec documents will be valuable.

In the interim, explicitly stating that the test suite results should considered canonical in cases where the spec is ambiguous would address much of this issue. (For other cases where Norman notes CSL-M-specific content in tests, these should be explicitly labeled as such or removed.)

Yes.

Concerning one or two docs: Having two distinct documents is a good idea, but I guess there will be some overlap though. Where the document for style authors will say "do X if you want Y output" the document for citeproc implementers will say "citeprocs are expected to process X in a Y way". I don't know if there is an elegant way to solve this and to keep both documents in sync.

@bwiernik
Copy link
Member

I think the citeproc spec would be supplemental to the main spec.

@denismaier
Copy link
Member

denismaier commented May 25, 2020

2. An important item not covered in the CSL specification, but very much relevant to any implementation of the language is that styles in the CSL repository very often set joining punctuation in affixes. When elements are omitted, this can result in duplicate punctuation or spaces, when the suffix on one element is joined directly to the prefix on another.

Just for the record: Biblatex has a very nice mechanism for this problem: you add \newunit or \newblock to add delimiters, but those get only printed with the next field. And there will always only be one new unit delimiter. It's like adding punctuation on a stack, but only if the stack is currently empty or if you add a higher value delimiter.

@bdarcus
Copy link
Member

bdarcus commented May 25, 2020

If the corpus is too large to refactor, put a version number on the style element and say that removing adjacent punctuation is manditory on "old" styles and forbidden on "new" ones.

Numbers:

  • 871 tests here.
  • over 8,500 CSL styles in our repo (though most are dependent, so basically aliases that contain no processing instructions)
  • dozens of implementations
  • hundreds of thousands of users (I think; not sure on the specifics here)

We already have a version number on the style element. All current styles are "1.0".

So it would be feasible to do a "1.1" release that did this; cleaned up and simplified the currently implicit processing expectations, and so also made clear new styles need to account for duplicate punctuation, etc. We've been discussing new releases recently elsewhere, and we can include this.

But if we did, we'd obviously want to identify any other similar examples.

I really think the best immediate path on the test-suite is to create a separate subdirectory and start to move tests like this to that, and while doing it, add some metadata to these that tag them as such, maybe with some description. We could start with this one.

With 871 tests, this will take time.

@fbennett
Copy link
Member

To simplify things for the present, I've added explicit trailing space on all cite prefixes, and explicit leading space on all suffixes that require it. (264f3bf), retaining a copy of the original fixtures in the citeproc-js source. Hope this helps.

@bdarcus
Copy link
Member

bdarcus commented May 29, 2020

To simplify things for the present, I've added explicit trailing space on all cite prefixes, and explicit leading space on all suffixes that require it. (264f3bf), retaining a copy of the original fixtures in the citeproc-js source. Hope this helps.

@ndw - does this indeed help, so I can close this?

I'll leave the broader #17 issue for addressing your broader point longer term, though at this point most of the work needs to happen in documentation.

@bwiernik
Copy link
Member

bwiernik commented May 29, 2020

We should discuss spacing behavior, perhaps in a different issue. I think that citeproc-js's behavior here is correct, especially for GUI clients like Zotero or Mendeley, where whether a space if provided at the end of a prefix is often not clear to the user.

@bdarcus
Copy link
Member

bdarcus commented May 29, 2020

We should discussion spacing behavior, perhaps in a different issue. I think that citeproc-js's behavior here is correct, especially for GUI clients like Zotero or Mendeley, where whether a space if provided at the end of a prefix is often not clear to the user.

So maybe open an issue on the documentation repo to clarify the spec on this, link to this issue, and then revise the test to confirm to the new spec language?

@bdarcus
Copy link
Member

bdarcus commented Jun 5, 2020

We should discussion spacing behavior, perhaps in a different issue. I think that citeproc-js's behavior here is correct, especially for GUI clients like Zotero or Mendeley, where whether a space if provided at the end of a prefix is often not clear to the user.

I opened a new linked issue. But CSL is not for GUI clients only, so we can't base decisions on that. I have a hunch they can tweak their code to help users here.

Whatever we settle on, though, we need to make it explicit in the spec, per the point of this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants