Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

Open
wants to merge 54 commits into
base: master
Choose a base branch
from

Conversation

theshoeshiner
Copy link

This is a more formal API for parsing and formatting strings of various "Cases".

Similar logic is present in org.apache.commons.text.CaseUtils as well as #360 to add two additional case functions, but that code is based around the initial string having delimiters, and cannot translate between cases. This API aims to be slightly more formal, expecting the inputs to the parse method to abide by the syntax of the case. No assumptions (other than the Case syntax) are made about the inputs, and removal of unwanted characters is left to the user. Case classes can be sub classed as necessary to support that.

The Case interface exposes two methods...

String format(Iterable<String> tokens)
List<String> parse(String string)

with parse() returning a List of String tokens, which can then be passed to another Case instance format() method. Allowing users to change the case of a string e.g. CamelCase.INSTANCE.format(KebabCase.INSTANCE.parse("my-kebab-string")). I followed the pattern of other commons-text classes in using code points rather than primitive chars (I assume to support UTF32).

@theshoeshiner
Copy link
Author

theshoeshiner commented Aug 18, 2023

Also meant to add... As an experiment I implemented a Case that uses the CaseUtils logic to parse tokens such that one could theoretically deprecate that class and use this api like CamelCase.INSTANCE.format(CaseUtilsCase.INSTANCE.parse("...")) ) But I have omitted that from the PR unless feedback indicates that deprecation is warranted. The CaseUtilsCase would throw an Unsupported exception upon calling format(...) because the logic of that code inherently has no implied format, and can throw away significant amounts of input such that reproducing any given input format would be impossible. i.e. it acts more like a "cleanup" function than a true "Case".

src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/test/java/org/apache/commons/text/cases/CasesTest.java Outdated Show resolved Hide resolved
src/test/java/org/apache/commons/text/cases/CasesTest.java Outdated Show resolved Hide resolved
src/test/java/org/apache/commons/text/cases/CasesTest.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/SnakeCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/SnakeCase.java Outdated Show resolved Hide resolved
@theshoeshiner
Copy link
Author

@elharo
In regards to a couple of the above changes, is there a reason we want to prevent these classes from being extended? I can imagine many scenarios where one would want to add additional logic to them.

@garydgregory
Copy link
Member

IMO: You should follow the YAGNI principle and only make public and protected what you must, especially in a first cut.

Copy link

@elharo elharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I like this proposal in general. I'd love to get some beta feedback before we lock everything in though. Does Apache commons have any sort of beta or incubation phase for APIs we can use?

  2. Is there any good source of typical strings in other formats we could use for more extensive test coverage? E.g. test cases for a similar library in a different language?

src/main/java/org/apache/commons/text/cases/Case.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved
@garydgregory
Copy link
Member

garydgregory commented Sep 10, 2023

"Does Apache commons have any sort of beta or incubation phase for APIs we can use?"

No. The best we can do is to issue a release with a -beta1 or -alpha1 or -M1 version suffix like 1.12.0-beta1. See the Commons FileUpload and Commons Imaging components for recent examples.

@elharo
Copy link

elharo commented Sep 10, 2023

Guava has @Beta for APIs they want to push but not commit to yet. Perhaps we can add something similar?

@garydgregory
Copy link
Member

Guava has @Beta for APIs they want to push but not commit to yet. Perhaps we can add something similar?

This feels unwise to me and I would discourage it. Based on what I've seen in Commons, once something is public or protected, it will be used and complained about when changed.

This would also make JApiCmp more complex to configure: How do you tell it to ignore classes, methods, and fields annotated with @foo?. I certainly would not want to individually list each in-flux element. Same issue with using a ".beta." package name.

If you put an Alpha, Beta or M label on the whole release then you make it obvious from the onset that you are offering an unstable platform, which is the approach Commons has already taken with Commons FileUpload and Commons Imaging. Using a Beta release requires buy-in from the user that there are unstable elements, as opposed to using a normal full release that happens to have unstable parts in it. Commons has been very good at maintaining binary compatibility in normal releases, so I think it wise to keep maintaining this expectation.

@elharo
Copy link

elharo commented Sep 10, 2023

Yes people will complain, but if it was explicitly marked beta, you tell them to pound sand (more politely).

The problem with a complete beta release in a case like this is there's so much more in commons that isn't beta. We need a way to ship stable features while also experimenting with less stable APIs.

Binary compatibility would be maintained for anyone who doesn't use explicitly Beta APIs.

I'm not sure what JApiCmp is, but this approach has been tried and proven in Guava and other Google open source Java libraries.

Probably worth moving this discussion to the list.

@theshoeshiner
Copy link
Author

@elharo @garydgregory

I converted the cases API to use StringTokenizer as suggested by @elharo. In general this works out fine and greatly simplifies those classes, but necessitates a couple of other changes that I'd request your comments on...

  1. org.apache.commons.text.TokenStringifier - This is essentially the inverse of the StringTokenizer - to allow the Case implementations to control output formatting. It's a fairly simple class but I don't see anything else that really fits the bill. It includes the TokenFormatter interface, which cases can implement to customize the formatting:
public interface TokenFormatter {
    String format(char[] prior, int tokenIndex, char[] token);
}

If there's an alternative to using these new classes I'm all for it, just tried to keep it simple but clear.

  1. org.apache.commons.text.matcher.AbstractStringMatcher.UppercaseMatcher - This new matcher powers the Pascal/Camel case implementations. The wrinkle is that the Matcher API doesn't really support matching on dynamic length patterns. i.e. the size() method is expected to return the same value regardless of the "match". This doesn't work with unicode characters because not every code point that fits the matcher has the same width. For right now im simply throwing UnsupportedOperationException from that method, since it's not needed for the Cases API. This means that matcher wouldn't work if used with the StringSubstitutorReader (which I think uses the size() method to control buffering?).

@speters33w
Copy link

Honestly, I'm not sure I'm in favor of splitting CaseUtils into a bunch of separate classes. Actually, I'm sure I'm not in favor of this.
This will be confusing to a user.
import CaseUtils;
CaseUtils.toKebabCase(str)
CaseUtils.toCamelCase(str)
That makes it easy.

Just sayin'

@theshoeshiner
Copy link
Author

theshoeshiner commented Apr 16, 2024

@speters33w

That suffers from all the same problems of the original implementation. This PR wasn't just to refactor the existing features, it was to enhance them to allow for other cases. The old feature can't handle certain scenarios, and doesn't allow the user to parse the string back into it tokens.

e.g. How would one convert from a camel cased string to a kebab case string using the methods you mentioned?

@speters33w
Copy link

@theshoeshiner

Funny... I was just starting to play with my clone of your project, I want to run some tests using:
𝔗𝔥𝔢 𝔮𝔲𝔦𝔠𝔨 𝔟𝔯𝔬𝔴𝔫 𝔣𝔬𝔵
𝕋𝕙𝕖 𝕢𝕦𝕚𝕔𝕜 𝕓𝕣𝕠𝕨𝕟 𝕗𝕠𝕩
𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱
𝑇ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑓𝑜𝑥
𝒯𝒽ℯ 𝓆𝓊𝒾𝒸𝓀 𝒷𝓇ℴ𝓌𝓃 𝒻ℴ𝓍
𝚃𝚑𝚎 𝚚𝚞𝚒𝚌𝚔 𝚋𝚛𝚘𝚠𝚗 𝚏𝚘𝚡

But no, you can't convert back and forth to and from camelCase without tokens.

@speters33w
Copy link

Well, at least this one fails, it's the first test I ran

The text is 𝑡ℎ𝑒_𝑞𝑢𝑖𝑐𝑘_𝑏𝑟𝑜𝑤𝑛_𝑓𝑜𝑥

        assertFormatAndParse(SnakeCase.INSTANCE, "\uD835\uDC61\uD835\uDC55\uD835\uDC52_\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58_\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B_\uD835\uDC53\uD835\uDC5C\uD835\uDC65",
                Arrays.asList("\uD835\uDC61\uD835\uDC55\uD835\uDC52", "\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58", "\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B", "\uD835\uDC53\uD835\uDC5C\uD835\uDC65"));

AlphabetConverter returns java: illegal character: '\ufffd'

@theshoeshiner
Copy link
Author

theshoeshiner commented Apr 16, 2024

That line of code runs fine for me. Can you send the stacktrace and/or the full unit test method?

@speters33w
Copy link

It's not giving me a stack trace, it's giving a build failure. I used Zulu 21 then Temurin 21 under IntelliJ Idea Ultimate

Here is the method:

    @Test
    public void testUtf32() {
        assertFormatAndParse(KebabCase.INSTANCE, "\uD800\uDF00-\uD800\uDF01\uD800\uDF14-\uD800\uDF02\uD800\uDF03",
                Arrays.asList("\uD800\uDF00", "\uD800\uDF01\uD800\uDF14", "\uD800\uDF02\uD800\uDF03"));
        assertFormatAndParse(SnakeCase.INSTANCE, "\uD800\uDF00_\uD800\uDF01\uD800\uDF14_\uD800\uDF02\uD800\uDF03",
                Arrays.asList("\uD800\uDF00", "\uD800\uDF01\uD800\uDF14", "\uD800\uDF02\uD800\uDF03"));

        assertFormatAndParse(PascalCase.INSTANCE, "A\uD800\uDF00B\uD800\uDF01\uD800\uDF14C\uD800\uDF02\uD800\uDF03",
                Arrays.asList("A\uD800\uDF00", "B\uD800\uDF01\uD800\uDF14", "C\uD800\uDF02\uD800\uDF03"));
        assertFormatAndParse(CamelCase.INSTANCE, "a\uD800\uDF00B\uD800\uDF01\uD800\uDF14C\uD800\uDF02\uD800\uDF03",
                Arrays.asList("a\uD800\uDF00", "B\uD800\uDF01\uD800\uDF14", "C\uD800\uDF02\uD800\uDF03"));

        assertFormatAndParse(SnakeCase.INSTANCE, "\uD835\uDC61\uD835\uDC55\uD835\uDC52_\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58_\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B_\uD835\uDC53\uD835\uDC5C\uD835\uDC65",
                Arrays.asList("\uD835\uDC61\uD835\uDC55\uD835\uDC52", "\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58", "\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B", "\uD835\uDC53\uD835\uDC5C\uD835\uDC65"));

    }
C:\OneDrive\Java\projects\FromGitHub\commons-text\src\main\java\org\apache\commons\text\AlphabetConverter.java
java: illegal character: '\ufffd'

Maybe I fat-fingered something wrong into the method

@theshoeshiner
Copy link
Author

Yea that's something wrong with your build. The test can't be failing because its not being run to begin with. That class wasn't changed as part of this PR so I'm not sure what is causing the illegal character issue.

@speters33w
Copy link

Yeah, I don't know what I did, but it works, now, including my new test.

Process finished with exit code 0

You might need to rebase your project and do a push, I don't know if it has to do with 1.12RC, or something, but I tried re-cloning from this PR, and none of your work was in it. Then I tried cloning directly from your repository, same thing, then I tried downloading the zip file from your repository and all your work was gone.

I had to manually copy cases directory, the matcher files you edited, all the tokenizer stuff, etc. from the one I cloned a couple days ago into the new clone, and git thinks they are new and changed files. The release notes in what I downloaded say 1.11.0 Weird.

This PR and my PR 528 are not incompatible with each other. They serve different purposes. Mine is designed to get everything into lower ASCII and spit out the string, while yours is designed to keep the Unicode characters intact as tokens for future manipulation.

I don't think mine can handle UTF32 characters, so 𝑡ℎ𝑒_𝑞𝑢𝑖𝑐𝑘_𝑏𝑟𝑜𝑤𝑛_𝑓𝑜𝑥 would fail as an input on 528. I'll have to try it. I haven't, yet.

I frittered away my morning playing with this pull (most of it trying to re-clone) so I will have to revisit it because I have a bunch of other stuff to do today. I'll probably re-visit Sunday.

@speters33w
Copy link

@theshoeshiner

I got everything re-merged into the initial clone, and git likes it. It finds differences in StringTokenizerTest, but that's it.

I think I must have accidentally pasted utf32 characters directly into that AlphabetConverter (which I wasn't editing, but I was nosing around in there), and Java didn't know what to do with it, so the compile burped.

OK, I'm impressed. This passes:

        assertParse(CamelCase.INSTANCE, "\uD835\uDE96\uD835\uDEA2\uD835\uDE72\uD835\uDE8A\uD835\uDE96\uD835\uDE8E\uD835\uDE95\uD835\uDE85\uD835\uDE8A\uD835\uDE9B\uD835\uDE92\uD835\uDE8A\uD835\uDE8B\uD835\uDE95\uD835\uDE8E",
                Arrays.asList("\uD835\uDE96\uD835\uDEA2", "\uD835\uDE72\uD835\uDE8A\uD835\uDE96\uD835\uDE8E\uD835\uDE95", "\uD835\uDE85\uD835\uDE8A\uD835\uDE9B\uD835\uDE92\uD835\uDE8A\uD835\uDE8B\uD835\uDE95\uD835\uDE8E"));

The string is: "𝚖𝚢𝙲𝚊𝚖𝚎𝚕𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎". The tokens are "𝚖𝚢", "𝙲𝚊𝚖𝚎𝚕", "𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎".

You'd only see a variable like this in something like Julia, but danged if your API didn't work like a charm on it.

I'm liking this.

@theshoeshiner
Copy link
Author

theshoeshiner commented Apr 16, 2024

The provided implementations are fairly simple (even more so after relying on StringTokenizer) and aim to impose as few restrictions as possible, relying on the java.lang.Character class to make most case determinations.

Using the API is a bit more verbose than a single util method call, but it comes with more than a few advantages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants