Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of string(:printable) with min_length opt given #99

Open
volnyvolnyvolny opened this issue Apr 18, 2018 · 5 comments
Open
Labels

Comments

@volnyvolnyvolny
Copy link

volnyvolnyvolny commented Apr 18, 2018

Hi! I'm using stream_data to test my Ecto schemas, but I ran into a small problem concerning min_length opt given to StreamData.string(:printable).

property "string(:printable) min_length" do
  # While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
  check all s <- string(:alphanumeric, min_length: 41) do
    IO.inspect(String.length(s))
    assert String.length(s) >= 41
  end

  # This returns 41,41,40,40,...39,40.
  check all s <- string(:printable, min_length: 41) do
    IO.inspect(String.length(s))
    assert String.length(s) >= 41
  end
end

Error:

1) property string string(:printable) min_length (StringTest)
   test/string_test.exs:6
   Failed with generated values (after 4 successful runs):
 
       * Clause:    s <- string(:printable, min_length: 41)
         Generated: "𐀀𐀀𐀀𐀀𐀀ꢴ񭺍񅂤𜚸򫱞򑐑񄧸󫋎󣪣򵧸򙠘𯜬󽳷󁎫򰹡󑳃򭲠񗪙񀌾𐖜󠱺񅖬󔬺񽇾𬆟񘂑񦪥򖀣򍿖񪲲󙭠𛢢񉗒􊠮񏱉򍄓"
 
   Assertion with >= failed
   code:  assert String.length(s) >= 41
   left:  40
   right: 41
   stacktrace:
     test/string_test.exs:15: anonymous fn/2 in StringTest."property string string(:printable) min_length"/1
     (stream_data) lib/stream_data.ex:2063: StreamData.shrink_failure/6
     (stream_data) lib/stream_data.ex:2027: StreamData.check_all/7
     test/string_test.exs:13: (test)

PS> Off topic. Don't want to spam with feature requests, but is it possible to add invert function that reverts generators? So invert(string(:alphanumeric, min_length: 10)) gives us a one_of([string(:alphanumeric, max_length: 9), string(:printable) |> filter(&(not String.match?(&1, ~r/^[0-9A-Za-z]+$/)))])? At the first glance It is always set-theoretically possible and may be very handy.

@whatyouhide
Copy link
Owner

I'll investigate but my guess is String.length/1 which counts codepoints (while :min_length is probably for bytes). If you want you can try to see if byte_size instead of String.length behaves correctly.

PS> Off topic

Please open another issue for discussing this.

@volnyvolnyvolny
Copy link
Author

volnyvolnyvolny commented Apr 18, 2018

I thought so too at first, but here is what I got:

string_length: 40
byte_size: 160

string_length: 39
byte_size: 160

string_length: 41
byte_size: 163

string_length: 41
byte_size: 161

string_length: 40
byte_size: 160

So, for 40-symbol codepoints string it is around 160 bytes.

property "string(:printable) min_length" do
   # While this works fine, giving 41,41,41,...,44,45,42,43,44,47,44,...
   check all s <- string(:alphanumeric, min_length: 41) do
     IO.inspect(String.length(s))
     assert String.length(s) >= 41
   end

   check all s <- string(:printable, min_length: 41) do
      IO.inspect(String.length(s), label: :string_length)
      IO.inspect(byte_size(s), label: :byte_size)
      assert String.length(s) >= 41
   end
end

@josevalim
Copy link
Contributor

josevalim commented Apr 19, 2018 via email

@whatyouhide
Copy link
Owner

The thing is that we are generating a list of :min_length elements, so that part is correct, the problem is that each element in the list can be multiple bytes long. For example, 0xFFFD is 3 bytes long. So our strategy is already "off" in some way, because we're not guaranteeing that String.length/1 will be more than :min_length (because of graphemes I suspect, as pointed out by @josevalim) but also we're saying that byte_size/1 will probably be much bigger because on average the codepoints we use are several bytes long. Not sure how to approach the problem yet.

@volnyvolnyvolny
Copy link
Author

volnyvolnyvolny commented Apr 19, 2018

Yes, as the @josevalim pointed it's because of the difference between counting codepoints and counting graphems.

codepoints_length: 41
string_length: 41
byte_size: 159

codepoints_length: 41
string_length: 41
byte_size: 162

(!)    codepoints_length: 41
(!)    string_length: 40
byte_size: 163


defp codepoints_length(<<_::utf8, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<_, rest::binary>>, acc), do: codepoints_length(rest, acc + 1)
defp codepoints_length(<<>>, acc), do: acc

property "string(:printable) min_length" do
  check all s <- string(:printable, min_length: 41) do
    IO.inspect(codepoints_length(s, 0), label: :codepoints_length)
    IO.inspect(String.length(s), label: :string_length)
    IO.inspect(byte_size(s), label: :byte_size)
    assert String.length(s) >= 41
  end
end

For me, personally, this problem arose when I tested my Ecto schemas (similar to the example in #100). In validator I had validate_length(max: 40) and I was very-very surprised when string(:printable, min_length: 41) gave me an Ecto-valid schema. But the thing is that Ecto.validate_length has a special option count that defaults to :graphems (but could be :codepoints). So for me now the bypass is to provide count: :codepoints option to all the validators :)

Maybe we could add count option to the &StreamData.string/2 similar to the one used in &Ecto.Changeset.validate_length/2? This way user could generate codepoints-counted and graphems-counted strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants