yul literal value as struct of u256 + optional formatting hint #15112

clonker · 2024-05-16T10:34:02Z

Changed the definition of the solidity::yul::Literal to carry its value not as YulString but as LiteralValue struct, consisting of a u256 and an optional string representation hint. Upon converting from its data back to a string representation, it is first checked if the hint is not empty and in that case, whether value == parseValue(hint).

libyul/AsmJsonImporter.cpp

libyul/Utilities.cpp

Wanted to comment and not approve.

nikola-matic · 2024-05-22T12:10:43Z

libyul/AsmJsonImporter.cpp

 		);
+		unlimited = member(_node, "unlimited").get<bool>();


You'll need a changelog entry for this, since it's a visible change (i.e. user facing).

fair point, i postponed doing anything changelog related before it's clear that everyone agrees this is the right approach

cameel

Not a complete review yet. Halfway through reading it I started realizing that I thought I knew how the hint and the unlimited flag worked but I actually didn't. I'm more and more convinced that this flag is not exactly the right solution here and we need to change it a bit. It's not really a property of the literal itself. But see my comments for details.

And some style nitpicking :)

libyul/AST.h

libyul/Utilities.cpp

test/tools/yulInterpreter/Interpreter.cpp

libyul/AsmJsonConverter.cpp

libyul/AsmParser.h

libyul/AsmJsonImporter.cpp

cameel · 2024-05-27T19:38:19Z

libyul/AST.h

+	LiteralValue(Data const& _data, std::optional<std::string> const& _hint = std::nullopt, bool _unlimited = false):
+		m_data(_data), m_hint(_hint ? std::make_shared<std::string>(*_hint) : nullptr), m_unlimited(_unlimited) {}


Can we assert here that the hint matches the data if present?

formatLiteral() ignores m_data when m_hint is present so we must ensure that these two are always consistent.

And I'd recommend this style:

Suggested change

LiteralValue(Data const& _data, std::optional<std::string> const& _hint = std::nullopt, bool _unlimited = false):

m_data(_data), m_hint(_hint ? std::make_shared<std::string>(*_hint) : nullptr), m_unlimited(_unlimited) {}

LiteralValue(Data const& _data, std::optional<std::string> const& _hint = std::nullopt, bool _unlimited = false):

m_data(_data),

m_hint(_hint ? std::make_shared<std::string>(*_hint) : nullptr),

m_unlimited(_unlimited)

{}

Hmm asserting it in this place is going to be difficult without knowing the LiteralKind. That is what determines the validity in the end. I have added a check to formatLiteral in the utilities, though, so that the hint isn't blindly taken but it is compared to the stored u256 value.

Alternatively, we could think about storing the LiteralKind in the LiteralValue as well. Then an assert in the constructor is possible and failure would occur earlier.

Yeah, it really feels like LiteralKind should be inside LiteralValue, because the value is ambiguous without it. If would let us get rid of a lot of redundant asserts that we now have to do at every step to ensure the values are valid.

But I think it would require changing a lot more code so perhaps better to leave at is for now and just finish the PR in the current state.

libyul/AsmAnalysis.cpp

libyul/AST.h

cameel · 2024-05-28T11:57:40Z

libyul/Utilities.cpp

+		if (_literal.kind == LiteralKind::Boolean)
+		{
+			yulAssert(_literal.value() == 0 || _literal.value() == 1, "Could not format boolean literal");
+			result = _literal.value() == 1 ? "true" : "false";
+		}
+		else
+			result = _literal.value().str();


This looks like you're formatting strings and integers the same way. How can that work? I.e. how does it preserve the distinction between 1234 and "1234"?

Actually, does the hint include the quotes? Because if it does not, then I can't see how we can distinguish values like '1234' and "1234" or '1234' and hex'1234'.

Also, if there's no distinction, the value cannot be unambiguously recovered from the hint (which matters if we want to enforce that they match).

By the way, it would be good to have those examples as test cases if we don't have them somewhere already.

I'd also add something like this:

object "A" { code { pop(datasize(hex'616263')) } data 'abc' "1234" }

I actually only now realized that we're allowing anything other than plain strings for literal arguments, not sure we even have it covered (only for arguments though, not in data).

The quoting and escaping of string literals is done in AsmPrinter, so the LiteralKind is also always required to unambiguously print and/or recover a literal. In your example you'd get

object "A" { code { pop(datasize(hex'616263')) } data 'abc' "1234" } // ---- // step: disambiguator // // object "A" { // code { pop(datasize("abc")) } // data "abc" hex"31323334" // }

as for almost ambiguous literals

{ let a := 1234 let a_hex := 0x4d2 let b := "1234" let b_hex := "0x4d2" let c := '1234' let c_hex := '0x4d2' let d := hex"1234" let d_hex := hex"04d2" } // ---- // step: disambiguator // // { // let a := 1234 // let a_hex := 0x4d2 // let b := "1234" // let b_hex := "0x4d2" // let c := "1234" // let c_hex := "0x4d2" // let d := "\x124" // let d_hex := "\x04\xd2" // }

Oh, so we're actually not preserving the representation perfectly? That's surprising. I had the impression that the AST allowed us to mostly reproduce the original source aside from comments and whitespace and that the representation hint was meant to help with that, but apparently not. I see from your output that it doesn't even keep hex numbers as hex so it seems kinda useless.

I wonder what we'd really lose by removing it. I wouldn't do it in this PR, but maybe we should consider that later?

…limited' - unlimited meaning an argument of a builtin function with corresponding entry in

cameel · 2024-06-07T17:45:40Z

libyul/AST.cpp

+LiteralValue::LiteralValue(std::string _builtinStringLiteralValue):
+	m_numericValue(0),


You don't seem to use the fact that it's an optional:

Suggested change

LiteralValue::LiteralValue(std::string _builtinStringLiteralValue):

m_numericValue(0),

LiteralValue::LiteralValue(std::string _builtinStringLiteralValue):

m_numericValue(std::nullopt),

With that you also don't really need the m_builtinStringLiteralValue flag.

BTW, class docstring says builtins - exceed the u256 word (32 bytes), in which case the value is set to `unlimited` . Did you meant that m_builtinStringLiteralValue is set to true?

Also, in SyntacticallyEqual::expressionEqual() you compare value() of normal and unlimited strings this way:

return valid && _lhs.value.value() == _rhs.value.value();

If you set the value to zero for unlimited they will compare as equal to a string "", which does not seem like it would represent actual syntactical equality.

cameel · 2024-06-07T17:55:53Z

libyul/AST.cpp

+bool LiteralValue::operator==(LiteralValue const& _rhs) const
+{
+	if (!unlimited() && !_rhs.unlimited())
+		return value() == _rhs.value();
+	else
+		return m_numericValue == _rhs.m_numericValue;
+}


Is this really correct? Shouldn't you be comparing m_stringValue for unlimited strings? m_numericValue is 0 for them (though it should really be nullptr) so they'll all compare equal.

cameel · 2024-06-07T20:09:32Z

libyul/Utilities.cpp

 	}
-	return it->second;
+	yulAssert(false);


You can use this helper to mark the rest of a function as unreachable:

Suggested change

yulAssert(false);

util::unreachable();

We shouldn't even need to do that, but at some point GCC broke its unreachability heuristic and they haven't fixed that since (see #13102/#13087). Hopefully we'll be able to purge all those unreachable() calls some day.

cameel · 2024-06-07T20:23:43Z

libyul/Utilities.cpp

-	if (_lhs.kind == LiteralKind::Number)
-		return valueOfNumberLiteral(_lhs) < valueOfNumberLiteral(_rhs);
+	if (_lhs.value.unlimited() != _rhs.value.unlimited())
+		return _lhs.value.unlimited() < _rhs.value.unlimited();
+
+	if (_lhs.value.unlimited())
+	{
+		yulAssert(
+			_lhs.kind == LiteralKind::String && _rhs.kind == LiteralKind::String,
+			"Cannot have unlimited value that is not of String kind."
+		);
+		return _lhs.value.builtinStringLiteralValue() < _rhs.value.builtinStringLiteralValue();
+	}
 	else
-		return _lhs.value < _rhs.value;
+		return _lhs.value.value() < _rhs.value.value();


Looks like the comparison of values can now be done without knowing the kind. In that case I think it would make sense to have an operator < in LiteralValue defined like this.

The only thing that needs the kind is the assert, but we could leave it here and after it just do _lhs.value < _rhs.value.

cameel · 2024-06-07T20:39:22Z

libyul/Utilities.cpp

+	if (_literal.kind == LiteralKind::Number && !_literal.value.unlimited())
+	{
+		if (_literal.value.hint())
+		{
+			auto const repr = *_literal.value.hint();
+			return (isValidDecimal(repr) || isValidHex(repr)) && bigint(repr) <= u256(-1);
+		}
+		return true;
+	}
+	return false;


Just a general style suggestion for these functions: I think the code is easier to process if you deal with special cases first rather than leave them hanging until you reach the else. It makes the flow more linear. For me it makes reading it easier:

Suggested change

if (_literal.kind == LiteralKind::Number && !_literal.value.unlimited())

{

if (_literal.value.hint())

{

auto const repr = *_literal.value.hint();

return (isValidDecimal(repr) || isValidHex(repr)) && bigint(repr) <= u256(-1);

}

return true;

}

return false;

if (!_literal.kind == LiteralKind::Number || _literal.value.unlimited())

return false;

if (!_literal.value.hint())

return true;

auto const repr = *_literal.value.hint();

return (isValidDecimal(repr) || isValidHex(repr)) && bigint(repr) <= u256(-1);

Similarly, in formatLiteral(), you could make it much less nested if the conditions returned immediately instead of storing the value in result only to return it at the end.

cameel · 2024-06-07T20:46:26Z

libyul/Utilities.cpp

+bool solidity::yul::validBoolLiteral(solidity::yul::Literal const& _literal)
 {
-	switch (_literal.kind)
+	if (_literal.kind == LiteralKind::Boolean && !_literal.value.unlimited())
 	{
-		case LiteralKind::Number:
-			return valueOfNumberLiteral(_literal);
-		case LiteralKind::Boolean:
-			return valueOfBoolLiteral(_literal);
-		case LiteralKind::String:
-			return valueOfStringLiteral(_literal);
-		default:
-			yulAssert(false, "Unexpected literal kind!");
+		return _literal.value.value() == true || _literal.value.value() == false;
 	}
+	return false;
 }


You're not validating the hint for boolean literals.

I see lots of checks for equality with "true" or "false" in the old code so I'm not sure why a lack of this check is not breaking anything. Are we missing test coverage?

cameel · 2024-06-07T21:02:15Z

libyul/Utilities.cpp

+	else
+		if (_literal.kind == LiteralKind::Boolean)
+		{
+			yulAssert(_literal.value.value() == 0 || _literal.value.value() == 1, "Could not format boolean literal");


formatLiteral() only has this single assert but I think there are more things you should have sanity checks against. For example that an unlimited literal is of the string kind or that the hint for boolean is actually true or false.

Or, if you define a validLiteral() function that executes the right validation based on the kind you could simply assert it at the beginning of the function. I've seen quite a few places where such a combined check would simplify assertions. AsmPrinter::operator()(Literal) is one example.

You might also use it at the end of valueOfLiteral() (or maybe better in each specific function, since they can be called individually). I see that you removed the original assertions from them but they seem still relevant to me. Or are the functions now used on invalid literals as well?

cameel · 2024-06-07T21:37:23Z

libyul/optimiser/SyntacticalEquality.cpp

+	if (_lhs.value.unlimited() || _rhs.value.unlimited())
+	{
+		if (_lhs.value.unlimited() && _rhs.value.unlimited())
+			return _lhs.value.builtinStringLiteralValue() == _rhs.value.builtinStringLiteralValue();
+		else
+		{
+			bool const valid = validStringLiteral(_lhs) && validStringLiteral(_rhs);
+			// the string literals are both valid, ie <32 chars
+			return valid && _lhs.value.value() == _rhs.value.value();
+		}
+	}
+
+	return _lhs.value == _rhs.value;


This compares value() between normal and unlimited strings. For syntactic equality, shouldn't we be comparing the string representations instead? I think that the previous version would have failed for different string representations of the same value and yours won't.

And value() comparison will never be true between a normal and unlimited string (assuming you the value of unlimited is nullopt). In fact an assert will fail if you run it on an unlimited string.

I wonder why none of this resulted in test failures. We're probably missing coverage here and if so, we should add a test that will trigger this code path.

Also, shouldn't invalid strings longer than 32 chars be syntactically equal to unlimited strings with the same representation? They will look identical in the source and have identical AST representation. And you already do consider two identical invalid strings to be equal so it would make sense to do the same thing for invalid compared with valid unlimited.

Can you even get an invalid string here? Maybe this should be an assert instead? Such a situation may only happen if you take an AST where a function call argument is a normal string on a position where an unlimited string is expected or the other way around. The parser after your changes won't ever produce anything like this so this could only happen if someone somehow imported doctored Yul AST with this exact node configuration - and we don't support Yul AST import right now. And if we did, I'd expect such AST to fail validation during import. We should probably just have a validLiteral() assert at the beginning of this function.

cameel · 2024-06-07T22:18:00Z

libyul/AsmAnalysis.cpp

 std::vector<YulString> AsmAnalyzer::operator()(Literal const& _literal)
 {
 	expectValidType(_literal.type, nativeLocationOf(_literal));
-	if (_literal.kind == LiteralKind::String && _literal.value.str().size() > 32)
+	if (_literal.kind == LiteralKind::String && !validStringLiteral(_literal))
 		m_errorReporter.typeError(
 			3069_error,
 			nativeLocationOf(_literal),
-			"String literal too long (" + std::to_string(_literal.value.str().size()) + " > 32)"
+			"String literal too long (" + std::to_string(formatLiteral(_literal).size()) + " > 32)"
 		);
-	else if (_literal.kind == LiteralKind::Number && bigint(_literal.value.str()) > u256(-1))
+	else if (_literal.kind == LiteralKind::Number && !validNumberLiteral(_literal))
 		m_errorReporter.typeError(6708_error, nativeLocationOf(_literal), "Number literal too large (> 256 bits)");
 	else if (_literal.kind == LiteralKind::Boolean)
-		yulAssert(_literal.value == "true"_yulstring || _literal.value == "false"_yulstring, "");
+		yulAssert(validBoolLiteral(_literal));


One small problem here is that if we ever add any extra checks to those validation functions, the messages will no longer match the errors.

We could make the messages more generic but maybe a better way to do this would be to keep the original checks and just assert validLiteral(_literal) after them to make sure that nothing slipped through?

cameel · 2024-06-07T22:53:55Z

libyul/AsmAnalysis.cpp

-					if (std::get<Literal>(arg).value.empty())
+					static u256 const empty {valueOfStringLiteral("").value()};
+					auto const& literalValue = std::get<Literal>(arg).value;
+					if ((literalValue.unlimited() && literalValue.builtinStringLiteralValue().empty()) || (!literalValue.unlimited() && literalValue.value() == empty))
 						m_errorReporter.typeError(
 							1844_error,
 							nativeLocationOf(arg),
 							"The \"verbatim_*\" builtins cannot be used with empty bytecode."
 						);
 				}
-
-				argTypes.emplace_back(expectUnlimitedStringLiteral(std::get<Literal>(arg)));
+				bool const isUnlimitedLiteralArgument = [&]() {
+					if (BuiltinFunction const* f = m_dialect.builtin(_funCall.functionName.name))
+						return f->literalArguments.size() >= i && f->literalArguments.at(i-1).has_value();
+					return false;
+				}();
+				argTypes.emplace_back(expectStringLiteral(std::get<Literal>(arg), isUnlimitedLiteralArgument));
 				continue;


I see that you're handling both normal and unlimited strings here even though here we know we're analyzing a call with unlimited literal arguments. Given that in the parser you made sure to use unlimited LiteralValue for such calls, it would be perfectly fine to just assert here that the argument is such a value and make the code simpler with that assumption.

In fact, if the parsing is wrong in some way and misses some cases, being too liberal here will let them slip through and we won't discover them easily.

BTW, isUnlimitedLiteralArgument is redundant anyway. It will always be true here because we're inside this overcomplicated condition that already checked that this is the case:

if ( auto literalArgumentKind = (literalArguments && i <= literalArguments->size()) ? literalArguments->at(i - 1) : std::nullopt )

nikola-matic previously approved these changes May 17, 2024

View reviewed changes

libyul/AsmJsonImporter.cpp Outdated Show resolved Hide resolved

libyul/AsmJsonImporter.cpp Outdated Show resolved Hide resolved

libyul/Utilities.cpp Outdated Show resolved Hide resolved

clonker force-pushed the rework_yul_literals_2 branch from e1c8cd3 to 3ca7a0a Compare May 21, 2024 06:12

nikola-matic reviewed May 22, 2024

View reviewed changes

cameel reviewed May 27, 2024

View reviewed changes

cameel reviewed May 28, 2024

View reviewed changes

yul literal value is struct of u256 + optional formatting hint or 'un…

f712d53

…limited' - unlimited meaning an argument of a builtin function with corresponding entry in

clonker force-pushed the rework_yul_literals_2 branch from b9373bf to f712d53 Compare May 29, 2024 11:02

cameel reviewed Jun 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yul literal value as struct of u256 + optional formatting hint #15112

yul literal value as struct of u256 + optional formatting hint #15112

clonker commented May 16, 2024

nikola-matic May 22, 2024

clonker May 22, 2024

cameel left a comment

cameel May 27, 2024

cameel May 27, 2024

clonker May 29, 2024

cameel Jun 8, 2024

cameel May 28, 2024

cameel May 28, 2024

clonker May 29, 2024 •

edited

cameel Jun 8, 2024 •

edited

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024 •

edited

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024

cameel Jun 7, 2024 •

edited

cameel Jun 7, 2024

cameel Jun 7, 2024 •

edited

cameel Jun 7, 2024

		LiteralValue(Data const& _data, std::optional<std::string> const& _hint = std::nullopt, bool _unlimited = false):
		m_data(_data), m_hint(_hint ? std::make_shared<std::string>(*_hint) : nullptr), m_unlimited(_unlimited) {}

		LiteralValue::LiteralValue(std::string _builtinStringLiteralValue):
		m_numericValue(0),

yul literal value as struct of u256 + optional formatting hint #15112

Are you sure you want to change the base?

yul literal value as struct of u256 + optional formatting hint #15112

Conversation

clonker commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker May 29, 2024 • edited

Choose a reason for hiding this comment

cameel Jun 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel Jun 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel Jun 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cameel Jun 7, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clonker May 29, 2024 •

edited

cameel Jun 8, 2024 •

edited

cameel Jun 7, 2024 •

edited

cameel Jun 7, 2024 •

edited

cameel Jun 7, 2024 •

edited