Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability for Maps to cast to another case where the field names are different #5703

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

HawaiianSpork
Copy link

Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data.

This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values.

This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements.

Which issue does this PR close?

Closes #5702.

Rationale for this change

Arrow has field names for list elements and map elements. arrow-rs can cast lists with one name for elements (like elements) to another name like `items). But this is not supported for the map type which has field names, one for the elements, one for the keys and one for the values. This can be a limitation for anyone using arrays of Maps from different sources.

What changes are included in this PR?

The cast functions now support casting from one Map type to another Map type where nothing is different other than the root field elements name, the key field name and value field name.

Are there any user-facing changes?

No, I would not consider this a user breaking change unless some client was expecting an error when map root level field names did not match.

…s are different.

Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data.

This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values.

This can be helpful in merging record batches that may have come from different sources.  This also makes maps behave similar to lists which also have a field to distinguish their elements.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 29, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @HawaiianSpork -- this looks pretty close to me

Some high level comments:

  1. I think we should try to avoid adding methods to DataType as they proposed ones are somewhat specialized and DataType is a pub type anyways (so its implementation is not encapsulated)
  2. The PR title says "where field names are different" but this PR not only supports different field names but also casts the key and value types as well, I think
  3. There are a few more tests I think are needed (error cases and Nulls)

All in all I think this PR is quite nice. Thanks again

&entry_offsets,
).unwrap();

let new_type = DataType::Map(Arc::new(Field::new("entries_new", DataType::Struct(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an assert here on what the type of array was (to show it wasn't the same as new-type)?

@@ -7100,6 +7116,54 @@ mod tests {
FixedSizeListArray::from(list_data)
}

#[test]
fn test_cast_map_containers() {
let keys = vec!["0", "1", "5", "6", "7"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add test coverage for None (aka when the keys / values also have null values)?

I think it might also make this test clearer if you used MapBuilder to construct the inputs: https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html

@@ -670,6 +670,42 @@ impl DataType {
pub fn new_fixed_size_list(data_type: DataType, size: i32, nullable: bool) -> Self {
DataType::FixedSizeList(Arc::new(Field::new_list_field(data_type, nullable)), size)
}

/// Gets the key field in a map data type. For all other types returns None.
pub fn key_field(&self) -> Option<FieldRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are already matching on DataType::Map above you'll already have entries_field available -- so we could avoid adding this new API to Datatype

arrow-cast/src/cast/map.rs Outdated Show resolved Hide resolved
.collect::<Vec<_>>();
assert_eq!(&values_string, &vec!["test_val_1", "test_val_2", "long_test_val_3", "4", "test_val_5"]);
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also please add a negative tests for:

  1. not being able to cast two MapArrays with different values for ordered ?
  2. Two MapArrays where the values can not be cast (maybe where the values are Intervals and trying to cast to another map array where the values are Durations)

@@ -710,6 +718,14 @@ pub fn cast_with_options(
(_, FixedSizeList(ref to, size)) if *size == 1 => {
cast_values_to_fixed_size_list(array, to, *size, cast_options)
}
(Map(_, ordered1), Map(_, ordered2)) if ordered1 == ordered2 =>
cast_map_values(
array.as_map_opt()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could simply use as_map() given this is in the match clause when data type was a map. As in there is no reason to return this error as it is not expected and panic'ing would be consistent with the rest of the codebase if the data type didn't match the array

@@ -157,6 +159,12 @@ pub fn can_cast_types(from_type: &DataType, to_type: &DataType) -> bool {
(_, LargeList(list_to)) => can_cast_types(from_type, list_to.data_type()),
(_, FixedSizeList(list_to,size)) if *size == 1 => {
can_cast_types(from_type, list_to.data_type())},
(Map(_,ordered_from), Map(_, ordered_to)) if ordered_from == ordered_to =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below -- I think we could avoid key_type/value_type here

HawaiianSpork and others added 4 commits May 1, 2024 22:57
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
- simplify map casting logic to reuse the entries
- Added unit tests for negative cases
- Use MapBuilder to make the intended type clearer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maps cast to other Maps with different Elements, Key and Value Names
2 participants