Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

theshoeshiner · 2023-08-15T18:10:55Z

This is a more formal API for parsing and formatting strings of various "Cases".

Similar logic is present in org.apache.commons.text.CaseUtils as well as #360 to add two additional case functions, but that code is based around the initial string having delimiters, and cannot translate between cases. This API aims to be slightly more formal, expecting the inputs to the parse method to abide by the syntax of the case. No assumptions (other than the Case syntax) are made about the inputs, and removal of unwanted characters is left to the user. Case classes can be sub classed as necessary to support that.

The Case interface exposes two methods...

String format(Iterable<String> tokens)
List<String> parse(String string)

with parse() returning a List of String tokens, which can then be passed to another Case instance format() method. Allowing users to change the case of a string e.g. CamelCase.INSTANCE.format(KebabCase.INSTANCE.parse("my-kebab-string")). I followed the pattern of other commons-text classes in using code points rather than primitive chars (I assume to support UTF32).

theshoeshiner · 2023-08-18T12:52:55Z

Also meant to add... As an experiment I implemented a Case that uses the CaseUtils logic to parse tokens such that one could theoretically deprecate that class and use this api like CamelCase.INSTANCE.format(CaseUtilsCase.INSTANCE.parse("...")) ) But I have omitted that from the PR unless feedback indicates that deprecation is warranted. The CaseUtilsCase would throw an Unsupported exception upon calling format(...) because the logic of that code inherently has no implied format, and can throw away significant amounts of input such that reproducing any given input format would be impossible. i.e. it acts more like a "cleanup" function than a true "Case".

src/main/java/org/apache/commons/text/cases/CamelCase.java

src/test/java/org/apache/commons/text/cases/CasesTest.java

src/main/java/org/apache/commons/text/cases/SnakeCase.java

theshoeshiner · 2023-08-21T11:06:18Z

@elharo
In regards to a couple of the above changes, is there a reason we want to prevent these classes from being extended? I can imagine many scenarios where one would want to add additional logic to them.

garydgregory · 2023-08-21T11:28:24Z

IMO: You should follow the YAGNI principle and only make public and protected what you must, especially in a first cut.

elharo

I like this proposal in general. I'd love to get some beta feedback before we lock everything in though. Does Apache commons have any sort of beta or incubation phase for APIs we can use?
Is there any good source of typical strings in other formats we could use for more extensive test coverage? E.g. test cases for a similar library in a different language?

src/main/java/org/apache/commons/text/cases/Case.java

src/main/java/org/apache/commons/text/cases/CamelCase.java

src/main/java/org/apache/commons/text/cases/PascalCase.java

src/main/java/org/apache/commons/text/cases/CamelCase.java

garydgregory · 2023-09-10T11:23:15Z

"Does Apache commons have any sort of beta or incubation phase for APIs we can use?"

No. The best we can do is to issue a release with a -beta1 or -alpha1 or -M1 version suffix like 1.12.0-beta1. See the Commons FileUpload and Commons Imaging components for recent examples.

elharo · 2023-09-10T12:11:06Z

Guava has @Beta for APIs they want to push but not commit to yet. Perhaps we can add something similar?

garydgregory · 2023-09-10T12:34:53Z

Guava has @Beta for APIs they want to push but not commit to yet. Perhaps we can add something similar?

This feels unwise to me and I would discourage it. Based on what I've seen in Commons, once something is public or protected, it will be used and complained about when changed.

This would also make JApiCmp more complex to configure: How do you tell it to ignore classes, methods, and fields annotated with @foo?. I certainly would not want to individually list each in-flux element. Same issue with using a ".beta." package name.

If you put an Alpha, Beta or M label on the whole release then you make it obvious from the onset that you are offering an unstable platform, which is the approach Commons has already taken with Commons FileUpload and Commons Imaging. Using a Beta release requires buy-in from the user that there are unstable elements, as opposed to using a normal full release that happens to have unstable parts in it. Commons has been very good at maintaining binary compatibility in normal releases, so I think it wise to keep maintaining this expectation.

elharo · 2023-09-10T12:55:26Z

Yes people will complain, but if it was explicitly marked beta, you tell them to pound sand (more politely).

The problem with a complete beta release in a case like this is there's so much more in commons that isn't beta. We need a way to ship stable features while also experimenting with less stable APIs.

Binary compatibility would be maintained for anyone who doesn't use explicitly Beta APIs.

I'm not sure what JApiCmp is, but this approach has been tried and proven in Guava and other Google open source Java libraries.

Probably worth moving this discussion to the list.

src/main/java/org/apache/commons/text/cases/CamelCase.java

This reverts commit 8833ae2.

theshoeshiner · 2023-11-28T23:22:40Z

@elharo @garydgregory

I converted the cases API to use StringTokenizer as suggested by @elharo. In general this works out fine and greatly simplifies those classes, but necessitates a couple of other changes that I'd request your comments on...

org.apache.commons.text.TokenStringifier - This is essentially the inverse of the StringTokenizer - to allow the Case implementations to control output formatting. It's a fairly simple class but I don't see anything else that really fits the bill. It includes the TokenFormatter interface, which cases can implement to customize the formatting:

public interface TokenFormatter {
    String format(char[] prior, int tokenIndex, char[] token);
}

If there's an alternative to using these new classes I'm all for it, just tried to keep it simple but clear.

org.apache.commons.text.matcher.AbstractStringMatcher.UppercaseMatcher - This new matcher powers the Pascal/Camel case implementations. The wrinkle is that the Matcher API doesn't really support matching on dynamic length patterns. i.e. the size() method is expected to return the same value regardless of the "match". This doesn't work with unicode characters because not every code point that fits the matcher has the same width. For right now im simply throwing UnsupportedOperationException from that method, since it's not needed for the Cases API. This means that matcher wouldn't work if used with the StringSubstitutorReader (which I think uses the size() method to control buffering?).

src/main/java/org/apache/commons/text/TokenFormatterFactory.java

src/main/java/org/apache/commons/text/cases/CamelCase.java

src/main/java/org/apache/commons/text/cases/Cases.java

src/main/java/org/apache/commons/text/matcher/AbstractStringMatcher.java

https://github.com/theshoeshiner/commons-text.git into cases

speters33w · 2024-04-16T04:35:27Z

Honestly, I'm not sure I'm in favor of splitting CaseUtils into a bunch of separate classes. Actually, I'm sure I'm not in favor of this.
This will be confusing to a user.
import CaseUtils;
CaseUtils.toKebabCase(str)
CaseUtils.toCamelCase(str)
That makes it easy.

Just sayin'

theshoeshiner · 2024-04-16T13:24:33Z

@speters33w

That suffers from all the same problems of the original implementation. This PR wasn't just to refactor the existing features, it was to enhance them to allow for other cases. The old feature can't handle certain scenarios, and doesn't allow the user to parse the string back into it tokens.

e.g. How would one convert from a camel cased string to a kebab case string using the methods you mentioned?

speters33w · 2024-04-16T13:32:59Z

@theshoeshiner

Funny... I was just starting to play with my clone of your project, I want to run some tests using:
𝔗𝔥𝔢 𝔮𝔲𝔦𝔠𝔨 𝔟𝔯𝔬𝔴𝔫 𝔣𝔬𝔵
𝕋𝕙𝕖 𝕢𝕦𝕚𝕔𝕜 𝕓𝕣𝕠𝕨𝕟 𝕗𝕠𝕩
𝐓𝐡𝐞 𝐪𝐮𝐢𝐜𝐤 𝐛𝐫𝐨𝐰𝐧 𝐟𝐨𝐱
𝑇ℎ𝑒 𝑞𝑢𝑖𝑐𝑘 𝑏𝑟𝑜𝑤𝑛 𝑓𝑜𝑥
𝒯𝒽ℯ 𝓆𝓊𝒾𝒸𝓀 𝒷𝓇ℴ𝓌𝓃 𝒻ℴ𝓍
𝚃𝚑𝚎 𝚚𝚞𝚒𝚌𝚔 𝚋𝚛𝚘𝚠𝚗 𝚏𝚘𝚡

But no, you can't convert back and forth to and from camelCase without tokens.

speters33w · 2024-04-16T14:00:11Z

Well, at least this one fails, it's the first test I ran

The text is 𝑡ℎ𝑒_𝑞𝑢𝑖𝑐𝑘_𝑏𝑟𝑜𝑤𝑛_𝑓𝑜𝑥

        assertFormatAndParse(SnakeCase.INSTANCE, "\uD835\uDC61\uD835\uDC55\uD835\uDC52_\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58_\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B_\uD835\uDC53\uD835\uDC5C\uD835\uDC65",
                Arrays.asList("\uD835\uDC61\uD835\uDC55\uD835\uDC52", "\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58", "\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B", "\uD835\uDC53\uD835\uDC5C\uD835\uDC65"));

AlphabetConverter returns java: illegal character: '\ufffd'

theshoeshiner · 2024-04-16T14:04:47Z

That line of code runs fine for me. Can you send the stacktrace and/or the full unit test method?

speters33w · 2024-04-16T14:14:43Z

It's not giving me a stack trace, it's giving a build failure. I used Zulu 21 then Temurin 21 under IntelliJ Idea Ultimate

Here is the method:

    @Test
    public void testUtf32() {
        assertFormatAndParse(KebabCase.INSTANCE, "\uD800\uDF00-\uD800\uDF01\uD800\uDF14-\uD800\uDF02\uD800\uDF03",
                Arrays.asList("\uD800\uDF00", "\uD800\uDF01\uD800\uDF14", "\uD800\uDF02\uD800\uDF03"));
        assertFormatAndParse(SnakeCase.INSTANCE, "\uD800\uDF00_\uD800\uDF01\uD800\uDF14_\uD800\uDF02\uD800\uDF03",
                Arrays.asList("\uD800\uDF00", "\uD800\uDF01\uD800\uDF14", "\uD800\uDF02\uD800\uDF03"));

        assertFormatAndParse(PascalCase.INSTANCE, "A\uD800\uDF00B\uD800\uDF01\uD800\uDF14C\uD800\uDF02\uD800\uDF03",
                Arrays.asList("A\uD800\uDF00", "B\uD800\uDF01\uD800\uDF14", "C\uD800\uDF02\uD800\uDF03"));
        assertFormatAndParse(CamelCase.INSTANCE, "a\uD800\uDF00B\uD800\uDF01\uD800\uDF14C\uD800\uDF02\uD800\uDF03",
                Arrays.asList("a\uD800\uDF00", "B\uD800\uDF01\uD800\uDF14", "C\uD800\uDF02\uD800\uDF03"));

        assertFormatAndParse(SnakeCase.INSTANCE, "\uD835\uDC61\uD835\uDC55\uD835\uDC52_\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58_\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B_\uD835\uDC53\uD835\uDC5C\uD835\uDC65",
                Arrays.asList("\uD835\uDC61\uD835\uDC55\uD835\uDC52", "\uD835\uDC5E\uD835\uDC62\uD835\uDC56\uD835\uDC50\uD835\uDC58", "\uD835\uDC4F\uD835\uDC5F\uD835\uDC5C\uD835\uDC64\uD835\uDC5B", "\uD835\uDC53\uD835\uDC5C\uD835\uDC65"));

    }

C:\OneDrive\Java\projects\FromGitHub\commons-text\src\main\java\org\apache\commons\text\AlphabetConverter.java
java: illegal character: '\ufffd'

Maybe I fat-fingered something wrong into the method

theshoeshiner · 2024-04-16T14:17:44Z

Yea that's something wrong with your build. The test can't be failing because its not being run to begin with. That class wasn't changed as part of this PR so I'm not sure what is causing the illegal character issue.

speters33w · 2024-04-16T15:19:27Z

Yeah, I don't know what I did, but it works, now, including my new test.

Process finished with exit code 0

You might need to rebase your project and do a push, I don't know if it has to do with 1.12RC, or something, but I tried re-cloning from this PR, and none of your work was in it. Then I tried cloning directly from your repository, same thing, then I tried downloading the zip file from your repository and all your work was gone.

I had to manually copy cases directory, the matcher files you edited, all the tokenizer stuff, etc. from the one I cloned a couple days ago into the new clone, and git thinks they are new and changed files. The release notes in what I downloaded say 1.11.0 Weird.

This PR and my PR 528 are not incompatible with each other. They serve different purposes. Mine is designed to get everything into lower ASCII and spit out the string, while yours is designed to keep the Unicode characters intact as tokens for future manipulation.

I don't think mine can handle UTF32 characters, so 𝑡ℎ𝑒_𝑞𝑢𝑖𝑐𝑘_𝑏𝑟𝑜𝑤𝑛_𝑓𝑜𝑥 would fail as an input on 528. I'll have to try it. I haven't, yet.

I frittered away my morning playing with this pull (most of it trying to re-clone) so I will have to revisit it because I have a bunch of other stuff to do today. I'll probably re-visit Sunday.

speters33w · 2024-04-16T19:30:13Z

@theshoeshiner

I got everything re-merged into the initial clone, and git likes it. It finds differences in StringTokenizerTest, but that's it.

I think I must have accidentally pasted utf32 characters directly into that AlphabetConverter (which I wasn't editing, but I was nosing around in there), and Java didn't know what to do with it, so the compile burped.

OK, I'm impressed. This passes:

        assertParse(CamelCase.INSTANCE, "\uD835\uDE96\uD835\uDEA2\uD835\uDE72\uD835\uDE8A\uD835\uDE96\uD835\uDE8E\uD835\uDE95\uD835\uDE85\uD835\uDE8A\uD835\uDE9B\uD835\uDE92\uD835\uDE8A\uD835\uDE8B\uD835\uDE95\uD835\uDE8E",
                Arrays.asList("\uD835\uDE96\uD835\uDEA2", "\uD835\uDE72\uD835\uDE8A\uD835\uDE96\uD835\uDE8E\uD835\uDE95", "\uD835\uDE85\uD835\uDE8A\uD835\uDE9B\uD835\uDE92\uD835\uDE8A\uD835\uDE8B\uD835\uDE95\uD835\uDE8E"));

The string is: "𝚖𝚢𝙲𝚊𝚖𝚎𝚕𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎". The tokens are "𝚖𝚢", "𝙲𝚊𝚖𝚎𝚕", "𝚅𝚊𝚛𝚒𝚊𝚋𝚕𝚎".

You'd only see a variable like this in something like Julia, but danged if your API didn't work like a charm on it.

I'm liking this.

theshoeshiner · 2024-04-16T19:52:33Z

The provided implementations are fairly simple (even more so after relying on StringTokenizer) and aim to impose as few restrictions as possible, relying on the java.lang.Character class to make most case determinations.

Using the API is a bit more verbose than a single util method call, but it comes with more than a few advantages.

initial implementation of Cases api

b64a0e0

elharo suggested changes Aug 20, 2023

View reviewed changes

daniel.watson added 5 commits August 21, 2023 07:09

spelling

0de50a2

spelling

47c244f

private util method, spelling

ff8c791

class description

31bd896

loop

f8cbd57

daniel.watson added 7 commits August 21, 2023 08:04

private/protected methods, spelling

d741a00

use ArrayList

67ccbb8

doc formatting

b3a3368

doc spelling

f98d277

@throws docs, spelling

76de865

checkstyle violations

de03d81

doc formatting

dd0394b

theshoeshiner requested a review from elharo September 6, 2023 11:13

elharo reviewed Sep 10, 2023

View reviewed changes

javadoc formatting

1695051

elharo suggested changes Sep 13, 2023

View reviewed changes

src/main/java/org/apache/commons/text/cases/CamelCase.java Outdated Show resolved Hide resolved

daniel.watson added 5 commits October 2, 2023 10:52

implemented unicode casing logic

1cb481f

order of if/else

8833ae2

Revert "order of if/else"

c328233

This reverts commit 8833ae2.

spelling

eff0529

grammar

b32035f

daniel.watson and others added 8 commits October 31, 2023 07:07

javadocs

7fc8770

renamed class

89eca00

added string matcher that matches on any Unicode upper case tokens

8aa6edb

allow setting whether to omit the delimiter from the token

8b2ced4

test omit delimiter flag

992ac13

converted cases api to use StringTokenizer and TokenStringifier logic

a50975c

throw unsupported from uppercase matcher

1406d30

licenses

6747348

theshoeshiner requested a review from elharo November 28, 2023 23:23

elharo reviewed Nov 29, 2023

View reviewed changes

theshoeshiner and others added 8 commits November 29, 2023 11:42

dont need to check when there is not enough room left for constant

61bf2e5

tests for token factory and stringifier

05422f1

formatting

58f05e3

spelling

5481f1d

licenses

25236f1

Merge branch 'apache:master' into cases

9e2bcd1

Merge branch 'master' of

ce5d2e2

https://github.com/theshoeshiner/commons-text.git into cases

corrected merge

f91a1a2

speters33w mentioned this pull request May 26, 2024

TEXT-217: Add SnakeCase Parsing #552

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

theshoeshiner commented Aug 15, 2023

theshoeshiner commented Aug 18, 2023 •

edited

theshoeshiner commented Aug 21, 2023

garydgregory commented Aug 21, 2023

elharo left a comment

garydgregory commented Sep 10, 2023 •

edited

elharo commented Sep 10, 2023

garydgregory commented Sep 10, 2023

elharo commented Sep 10, 2023

theshoeshiner commented Nov 28, 2023

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 •

edited

speters33w commented Apr 16, 2024

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 •

edited

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024

speters33w commented Apr 16, 2024

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 •

edited

Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

Are you sure you want to change the base?

Cases API + 4 implementations (Pascal, Camel, Kebab, Snake) #450

Conversation

theshoeshiner commented Aug 15, 2023

theshoeshiner commented Aug 18, 2023 • edited

theshoeshiner commented Aug 21, 2023

garydgregory commented Aug 21, 2023

elharo left a comment

Choose a reason for hiding this comment

garydgregory commented Sep 10, 2023 • edited

elharo commented Sep 10, 2023

garydgregory commented Sep 10, 2023

elharo commented Sep 10, 2023

theshoeshiner commented Nov 28, 2023

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 • edited

speters33w commented Apr 16, 2024

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 • edited

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024

speters33w commented Apr 16, 2024

speters33w commented Apr 16, 2024

theshoeshiner commented Apr 16, 2024 • edited

theshoeshiner commented Aug 18, 2023 •

edited

garydgregory commented Sep 10, 2023 •

edited

theshoeshiner commented Apr 16, 2024 •

edited

theshoeshiner commented Apr 16, 2024 •

edited

theshoeshiner commented Apr 16, 2024 •

edited