Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More complete EDTF data model support #145

Open
cormacrelf opened this issue Oct 16, 2021 · 10 comments
Open

More complete EDTF data model support #145

cormacrelf opened this issue Oct 16, 2021 · 10 comments

Comments

@cormacrelf
Copy link
Contributor

cormacrelf commented Oct 16, 2021

Upcoming CSL-JSON changes include support for EDTF as a date input format. I recently implemented EDTF, and I have some thoughts about how we can make use of its features in CSL.

What EDTF has that we don't

EDTF is a great format for CSL, because we have supported date ranges since forever, and some of the unofficial date formats we use resemble EDTF already. However it adds three new things we did not have before.

  1. Unspecified parts of dates, using the X character to blot them out.
  2. A flag for "approximate" in addition to "uncertain"
  3. A datetime representation, e.g. 2019-07-16T01:57:29Z.
  4. A defined calendar.

Unspecified date parts / 1999-XX and friends

You might think that we could just add terms for month-unspecified and day-unspecified and call it a day. But I think we'd be missing out -- the spec doesn't advertise it very well, but the feature is more expressive than that.

There are a few different variations on the XX in EDTF level 1. In my opinion the spec should have named them like so: 19XX => century, 199X => decade, 1999-XX => month of year, 1999-XX-XX => day of year, 1999-07-XX => day of month. Styles/locales could render 19XX as "20th century" or "1900s" if they so wished! However, given this is academic citation, I'm not sure how useful that would be. If anyone can point to a style that might want special rendering for any of these forms, then it's something we can definitely do.

Approximate

We currently have is-uncertain-date, the circa term, and "circa": true in CSL-JSON. For reference, EDTF encodes these its uncertainties as ? => uncertain, ~ => approximate, % => both.

On a basic level, you could add terms for approximate and approximate-uncertain, and also add is-approximate-date="issued" as a conditional test.

One complication is that EDTF makes approx/uncertain a property of each end of a date range, i.e. you can have 1999?/2003 meaning (uncertain 1999) to 2003. Our current model is insufficient for that, it can only work with a date as a whole. You could therefore add a certainty date part as well, which simply renders one of the three terms or nothing, in either the single date or on each end of the range. This would be an improvement over the existing syntax even ignoring the approximate addition.

Date time representation

My favourite citation style, AGLC4, now supports citing tweets/forum posts/videos, and requires a timestamp as well as a date. It renders them like so:

Social media posts, forum posts and online videos uploaded to sites such as YouTube may be cited as follows:

Username, Title (Social Media Platform, Full Date, Time) <URL>.

... The time zone from which the post is accessed (eg ‘AEDT’) should be included if the social media platform adjusts the time based on the local time zone.

@s_m_stephenson (Scott Stephenson) (Twitter, 17 July 2017, 9:37pm AEST) <https://twitter.com/s_m_stephenson/status/8871694255514419 21>, ....

I don't think this will be the only one out there. We don't currently support times at all, and I think we should.

A couple of notes about this:

  • You might want a bunch of new <date-part>s for each one, but alternatively you could have only one new <date-part name="time" format="..." /> and just tell styles/locales to supply a time format string and reference one of the popular encodings for that.
  • I'd say <date-part name="timezone" /> as well.
  • EDTF's timezones are optional but can be set to Z (= UTC) or a +/- UTC offset in hours or hours:minutes. They are really just offsets, not zones.
  • That's not enough information to render "AEDT". If you add tz database entries like Australia/Melbourne that's probably enough info to query a list of known abbreviations for that tz at that time of year, DST-wise (but the abbreviations are not nearly as standardised as the tz names).
  • The Temporal API's datetime format allows people to store the zone in tz database entry format in addition to the offset, i.e. +03:00[Africa/Nairobi]. Not sure if we'd want that (complicates edtf parsing, is technically a completely new format if we bolt it on after a valid EDTF, so no thanks) but maybe some JSON way of specifying this would help.

A defined calendar

AFAIK CSL has never operated within a specific calendar, it just renders what you put in. EDTF uses the ISO 8601 calendar, see my notes here on what that means: https://docs.rs/edtf/0.2.0/edtf/#notes-on-edtf-and-the-iso-8601-calendar-system. (Obviously you would render these in gregorian style generally, ie 0000 renders as 1BC, -0099 as 100BC.) For modern dates, that's the same as we would normally write them, but in some places dates weren't written in the modern Gregorian calendar until the early 1900s (e.g. Russia, 1918). The UK only switched in 1752. That's really not that long ago, especially since some case law/legislation from before then is still cited fairly frequently.

Idea 1: Accuracy of old dates

I don't think you'll find any citation styles which dictate what calendar to write dates in, but that isn't to say that the problem doesn't exist; in fact it is probably part of the problem for historians, since nobody is forcing anyone else to write what kind of date something is. We could tip the scales with a very simple feature: a configuration in a style or a locale (?) which sets the start of the modern era for dates. Any date before this could be rendered with a term for new style dates (e.g. (n.s.)), thus forcing people to check that it actually is a new style date.

A much more complex feature would be the configurable rendering of dates in other calendars. I'm pretty sure @fbennett had a feature for rendering the oddities of Japanese calendars, but I'm not sure we should require every CSL implementation to do complex calendar maths. It could be an optional thing. If we wanted such a feature, we could make the the Unicode CLDR calendars optional. (Although, CLDR does not include Julian! How did they manage to omit it???)

Idea 2: Days of the week

Again, I don't know if any styles demand this, but until now it has not been technically possible to know which day of the week something is, because CSL didn't define a calendar. If you make CSL calendar aware, you get days of the week for free.

In summary

EDTF opens up a couple of new opportunities that are worth considering. The most obviously valuable one appears to be datetimes, but there are a lot of possibilities.

@bdarcus
Copy link
Member

bdarcus commented Oct 16, 2021

EDTF opens up a couple of new opportunities that are worth considering. The most obviously valuable one appears to be datetimes, but there are a lot of possibilities.

Yes, and basic 8601 dates are still valid.

Little thing: I've never understood the uncertain/approximate distinction, at least as it applies here. Do you?

@cormacrelf
Copy link
Contributor Author

I think it boils down to the words themselves:

  • Approximate is an approximation, a guess. You don't know exactly when it was but you estimate that it was X. A little bit bell curve shaped.
  • Uncertain is when you are getting your figures from somewhere but you are not sure if they are right. Conflicting information would also do it, both of the same apparent precision but since you have two different suggested dates when something happened, you can't be certain.
  • A date would be both approximate and uncertain when (e.g.) you got your estimate from someone else, and you aren't sure if they are right. Etc.

@cormacrelf
Copy link
Contributor Author

If anything "circa" should be for approximation, not uncertainty.

@bdarcus
Copy link
Member

bdarcus commented Oct 16, 2021

If anything "circa" should be for approximation, not uncertainty.

So then what should a CSL processor do with an uncertain date?

I had wondered if it should treat both as circa, but I guess we can treat them separately in the spec as well, so that a style could output "1521?" or "c. 1521", or even "c. 1521?"?

@denismaier
Copy link
Member

Some styles might treat it as meaning the same, but in general I think ca. vs ? sounds reasonable.

@bdarcus
Copy link
Member

bdarcus commented Oct 16, 2021

Some styles might treat it as meaning the same, but in general I think ca. vs ? sounds reasonable.

Right; so we definitely need to support both explicitly for input (as in edtf) and styles, and of course feature edtf in general prominently in the documentation, once we figure out our plan.

@cormacrelf
Copy link
Contributor Author

The issue with circa is if you make it synonymous with "approximate", you are left to deal with is-uncertain-date having to be backwards.

@bdarcus
Copy link
Member

bdarcus commented Oct 16, 2021

So just to make sure I understand, @cormacrelf:

The issue with circa is if you make it synonymous with "approximate", you are left to deal with is-uncertain-date having to be backwards.

You are saying:

  1. edtf approximate = csl circa
  2. but that conflicts with the current csl is-uncertain-date
  3. therefore, the implication is we should change the meaning of is-uncertain-date and add is-approximate-date to csl, and update all existing styles to use the latter instead?

Obviously that could be a little painful, but not that big a problem (to convert the styles is just a simple replacement).

@bwiernik
Copy link
Member

I had thought we had already implemented approximate and uncertain?

@bdarcus
Copy link
Member

bdarcus commented Nov 12, 2021

Would be good to clarify. @cormacrelf?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants