Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enumerations #43

Open
gregsdennis opened this issue Jun 7, 2023 · 14 comments
Open

Enumerations #43

gregsdennis opened this issue Jun 7, 2023 · 14 comments

Comments

@gregsdennis
Copy link
Member

https://github.com/Crell/enum-comparison

  • C/C++ - named integers
  • C# - named integers, but fields on static classes can support more complex objects (more of a pattern than language support), e.g. System.Drawing.SystemColors, also flag support allows bitwise operations
  • Java - explicit values, can be complex objects with private data
  • Python - named strings or integers
  • Typescript - named strings or integers
  • Haskell - string constants, kinda?
  • F# - named integers (but also supports unions)
  • Swift - string constants, kinda? also can contain simple or complex values, but not required
  • Rust - basically Swift
  • Kotlin - named strings or integers
  • Scala - explicit values (simple or complex)

The link above has a good summary, grouping these into three categories.

I think for JSON Schema, the primary takeaway is that they are all lists of values. Some languages allow more nuanced and powerful behaviors, but JSON Schema is more concerned with the data aspect than anything. As such, I think the collection of names is the important part here, which all support.

The enum keyword could work, but it may not be sufficient if underlying values are desired. For example, in C#, an enum can support bitwise operations, but to enable that, it needs to generate a [Flags] attribute and set all of the underlying integer values to powers of 2. Then it can also create named bitwise combinations. If just using a list of names, there's no way to describe this intent for proper code generation.

The "descriptive enum" approach using the anyOf keyword could work for this because we're just defining names and annotations for those names. However, the subschemas would be required to be uniform, and we'd probably still need another keyword to tell the codegen engine that we're defining an enum.

I recommend a new keyword (e.g. enumeration) to support this. It's still an array, but the items must either all be

  • just a string, in which case a simple set of names is generated (which seems to be universally supported)
  • an object with name and data properties which give more explicit information for more complex support
{
  "enumeration": [
    "HEARTS",
    "DIAMONDS",
    "CLUBS",
    "SPADES"
  ]
}

{
  "enumeration": [
    { "name": "HEARTS", "data": 1 },
    { "name": "DIAMONDS", "data": 2 },
    { "name": "CLUBS", "data": 3 },
    { "name": "SPADES", "data": 4 }
  ]
}

The second case becomes more complicated because of the different support among languages (even just the ones surveyed) for underlying data. Most support integer values, but not all, while some only support integer values. Some support more complex underlying data, while others don't support any underlying data.

I think the only resolution to this is that the schema can provide support for more complex needs, and those languages that don't support it can do what they deem appropriate, most likely just creating a list of names.

I also recommend the best practice of generating an "unknown" or "unset" enum value as the default.

In a validation context, the new enumeration keyword validates that the instance is either the string value of the item or the string value of the name of the item, whichever is defined.

Serialization

Another aspect to consider for enumerations is ensuring how things are serialized.

In C# circles there is often a debate as to whether an enum should be serialized by name or by the underlying integer. Historically, by integer is the default, which inevitably leads to someone adding a value in the middle of an enum, thereby changing the numbering for all the values that come after and screwing up deserialization of previously-serialized data.

The proposed solution to this is serializing by name, but that comes with its own risks, like name changes. Once a name is serialized somewhere, you pretty much need to support deserializing that name. As a result, spelling and other errors are forever persisted.

Do we want to provide guidance on this topic since we're effectively using schemas to define the serialized format?

@karenetheridge
Copy link
Member

I believe there was a proposal elsewhere to add a new annotation-only keyword that would sit adjacent to enum, to provide descriptions for use by code generators, document generators etc:

{
  "enum": ["foo", "bar", "baz"],
  "enumDescriptions": [
    "a foo thing",
    "a bar thing",
  ],
}

That way we wouldn't have to change the enum keyword, which has remained largely unchanged for the entire lifespan of JSON Schema.

@gregsdennis
Copy link
Member Author

I think the problem with enum, specifically in this context, is that its values can be of any type. For this purpose, we specifically want to limit the values to strings because that's the common support among languages. The object form I proposed for enumeration is allowed simply to provide additional definition for those languages which support it.

That said, I wouldn't be opposed to adding a string restriction to enum (a vocabulary is allowed to add constraints to existing keywords) alongside a new "enum metadata" keyword.

@jdesrosiers
Copy link
Member

I think JSON Schema's enum is a fundamentally different concept than what we find in most programming languages. Although it might sometimes be possible, I don't think codegen tooling should be trying to map one to another. I would expect codegen to create a type that represents a JSON Schema enum and use that rather than a native enum.

@gregsdennis
Copy link
Member Author

I would expect codegen to create a type that represents a JSON Schema enum and use that rather than a native enum.

Can you expand on this?

I'm leaning more toward just not using enum in a codegen setting in favor of enumeration. I think it's a mistake to not support such a ubiquitous language feature.

@jdesrosiers
Copy link
Member

Can you expand on this?

As a similar but simpler case, consider null. The JSON concept of null is not the same as the Java concept of null even tho they're spelled the same. The solution I've proposed to that problem is for a code generator to provide a constant that represents a JSON null (maybe JSON.null) and use that where the schema says null is allowed. Trying to map the JSON null to the Java null doesn't work, so we have to model the JSON version of null separately from the Java version of null.

I'm suggesting the same kind of thing for enums. They aren't the same thing and should be modeled differently. Ideally, it would be modeled as a union of values, but most type systems don't support that. For those that don't, I might expect a class JsonSchemaEnum whose constructor takes an array of allowed values and whose setter rejects any values that aren't in that array. It can only do runtime value checks, but the best that can be done considering the limitations of type system.

I'm leaning more toward just not using enum in a codegen setting in favor of enumeration.

I'd very much like to avoid people having to define things differently when in a codegen context than when in a validation context. Ideally, people should be able to use the same schemas for both. If we need to provide more context for code generation it should be in the form of annotations that provide additional context to a standard schema.

I think it's a mistake to not support such a ubiquitous language feature.

I think about this differently. JSON Schema describes JSON. JSON doesn't have a concept of enums, so I don't expect enums to necessary be generated by a code generator. It would be fine if it fit, but it doesn't. For example, it would be fine to generate a Date type for a string with format: "date" instead of a string. Those concepts fit nicely so the mapping is fine.

However, I can see how that breaks down if the scope of this project is to also support generating schemas from types. Is that in scope here? Not having a way to represent enums would mean you couldn't generate a schema from a class that uses an enum.

@yordis
Copy link

yordis commented Jun 9, 2023

After many years, of messing around with code-gen tools and writing one myself; I found two things that I would definitely love to have:

  1. Identity is given to the values of the enums.
  2. Colocated descriptions for a given value of the enums.

@gregsdennis
Copy link
Member Author

gregsdennis commented Jun 9, 2023

@jdesrosiers you're still coming at this from a JSON-Schema-first approach: trying to fit concepts that exist in JSON Schema into a programming language.

What we need to do is the opposite: try to find how to represent known programming language concepts in JSON Schema. Then this vocabulary only needs to support that subset of JSON Schema.

The problem with JSON-Schema-first is that we already know there is a multitude of things that JSON Schema can represent that don't make sense in programming languages. Trying to invent schema constructs and then jamming them into a programming language is wrong.

Start with the languages. Enumerations are a thing that just about every language does. We need to support that. The question is how we support it.

JSON Schema describes JSON.

While this is true, it hasn't stopped people and entire specifications (*cough* OpenAPI) from using it as model definition and code generation. The entire purpose of this vocabulary is to fill that gap. Stating that this isn't what JSON Schema is designed for doesn't help.

However, I can see how that breaks down if the scope of this project is to also support generating schemas from types. Is that in scope here?

Yes, I expect this vocab would define (or at least inform) the interface between JSON Schema and languages, both ways. I would love to see round-trip functionality where you start with a schema or a type, generate the other, then generate back to get the original.

I'd very much like to avoid people having to define things differently when in a codegen context than when in a validation context.

As proposed, enumeration does have validation behavior, and it's pretty much exactly the same as enum. It just expresses the constraint differently in order to support codegen. If we need, additional linting rules can be created to encourage people to use enumeration when this vocabulary is included. If the vocabulary isn't included, then enumeration isn't defined, and their schema will be invalid.

@jdesrosiers
Copy link
Member

you're still coming at this from a JSON-Schema-first approach: trying to fit concepts that exist in JSON Schema into a programming language. What we need to do is the opposite: try to find how to represent known programming language concepts in JSON Schema. Then this vocabulary only needs to support that subset of JSON Schema.

I'm still unconvinced that what you're describing is the right way to approach this. I am very strongly against defining a subset of JSON Schema, or worse a alternative dialect. I think people should be able to use the same schemas for codegen as the do for validation and they should not be limited in what validation features they can use because they also want to use the schema for codegen. I think the result of this effort should be a vocabulary of annotations that sits on top of full-featured JSON Schema where the annotations inform the codegen process.

The problem with JSON-Schema-first is that we already know there is a multitude of things that JSON Schema can represent that don't make sense in programming languages. Trying to invent schema constructs and then jamming them into a programming language is wrong.

I'm very much not arguing for jamming JSON Schema concepts into static type system features. JSON Schema does a lot of things that don't fit in a static type system. That's ok. I expect generated types to include only the things a type system can express. I don't expect the parts that don't fit to be jerry-rigged in somehow.

As an example, by default, I'd expect a code generator that encounters an allOf to combine the constraints defined in the sub-schemas and produce one type rather than one for each sub-schema and combining them somehow with inheritance or whatever. That's because allOf is a JSON Schema concept that doesn't map to a static type system and we shouldn't force it using some convention that might not always be right.

However, sometimes we do intend an inheritance-like relationship with an allOf. That's where this vocabulary comes in. We would introduce an annotation that tells the code generator that this allOf should be generated as inheritance.

@yordis
Copy link

yordis commented Jun 9, 2023

I would recommend messing around with the most popular OpenAPI code-gen tool (I did that 😄) to realize that they are creating extensions to compensate for the shortcomings of the JSON Schema spec.

Honestly, @jdesrosiers responses are going over my head. Still, there is a need for identities and improved docs to leverage the JSON Schemas as the source of truth or code-gen properly to most programming languages out there.

The same as having some sort of schema_id instead of using title, but that is something else.

@gregsdennis
Copy link
Member Author

I would recommend messing around with the most popular OpenAPI code-gen tool

I've actually just this week reached out to a couple to invite them to this conversation.

I'm still unconvinced that what you're describing is the right way to approach this.

We need another issue to discuss this. I'll open one.

@xiaoxiangmoe
Copy link

In our company's swagger tool, we may use

{
  "type": "number",
  "enum": [1, 2, 3, 4],
  "x-enumeration": [
    { "name": "HEARTS", "data": 1, "description": "" },
    { "name": "DIAMONDS", "data": 2, "description": "" },
    { "name": "CLUBS", "data": 3, "description": "" },
    { "name": "SPADES", "data": 4, "description": "" }
  ]
}

@gregsdennis
Copy link
Member Author

@xiaoxiangmoe thanks for the info. x-enumeration is defined by swagger and other pre-OpenAPI-3.1 specs. However it's not going to work going forward.

One of the decisions we've recently made is how ad-hoc annotation-only keywords are handled. The new spec will allow any unknown keyword that starts with x-. That may sound like it will work, but part of that decision was disallowing x- keywords in vocabularies in order to avoid collisions between vocab keywords and custom annotations.

Since we're building a vocabulary here, we can't use x-.

That said, using the data content or something similar is likely what we'll go for. The difficulty is trying to align with what languages support so that we can go back and forth between languages and schemas. Imagine generating a schema from Typescript, then trying to generate C code from the schema.

@gregsdennis
Copy link
Member Author

gregsdennis commented Sep 14, 2023

Related: json-schema-org/json-schema-spec#1386 (deprecating individual enum values)

@to-miz
Copy link

to-miz commented Dec 30, 2023

For my uses of JSON Schema, I don't use enum altogether, because it is too restrictive (changing enums is a breaking change and annotating them is a pain). I do something similar to how the schemas for glTF define their extensible enums:

"anyOf": [
    {
        "const": 34962,
        "description": "ARRAY_BUFFER",
        "type": "integer"
    },
    {
        "const": 34963,
        "description": "ELEMENT_ARRAY_BUFFER",
        "type": "integer"
    },
    {
        "type": "integer"
    }
]

Although I would prefer using title instead of description here.

I think JSON Schema already has good descriptive patterns for enums, if you explicitly enforce not actually using the enum keyword. General guidelines for writing code-gen friendly JSON Schemas would be preferable to me than introducing another way to define enums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants