Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-28866: [Java] Java Dataset API ScanOptions expansion #41646

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented May 14, 2024

Rationale for this change

What changes are included in this PR?

Support to add ArrowSchema to specify C++ CsvFragmentScanOptions.convert_options.column_types
And use Map to set the config, serialize in java and deserialize in C++ for CsvFragmentScanOptions

Are these changes tested?

new added UT.

Are there any user-facing changes?

No.

@jinchengchenghh jinchengchenghh marked this pull request as draft May 14, 2024 08:15
Copy link

⚠️ GitHub issue #28866 has been automatically assigned in GitHub to PR creator.

@jinchengchenghh
Copy link
Contributor Author

Can you help review this PR? If the framework is OK, I will add more common config in this PR. Thanks! @westonpace

*/
public ByteBuffer serialize() {
Map<String, String> options = Stream.concat(Stream.concat(readOptions.entrySet().stream(),
parseOptions.entrySet().stream()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert all the options to a map because it is a easy implement, and now we don't have same option name in CPP parse_options and read_options, but to further extend, we may need to serialize more accurately. I'm open to here if you think we should serialize each option

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe having a better serialize option for each would be better. But I see your point, maybe we could do it in a follow up PR.

cc @lidavidm @westonpace

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 16, 2024
@lidavidm
Copy link
Member

CC @vibhatha

@@ -108,7 +108,7 @@ ARROW_SUBSTRAIT_BUILD_SHA256_CHECKSUM=f989a862f694e7dbb695925ddb7c4ce06aa6c51aca
ARROW_S2N_TLS_BUILD_VERSION=v1.3.35
ARROW_S2N_TLS_BUILD_SHA256_CHECKSUM=9d32b26e6bfcc058d98248bf8fc231537e347395dd89cf62bb432b55c5da990d
ARROW_THRIFT_BUILD_VERSION=0.16.0
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=f460b5c1ca30d8918ff95ea3eb6291b3951cf518553566088f3f2be8981f6209
ARROW_THRIFT_BUILD_SHA256_CHECKSUM=df2931de646a366c2e5962af679018bca2395d586e00ba82d09c0379f14f8e7b
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Occasional change, for my local environment, will remove it

column_types[field->name()] = field->type();
}
} else {
return Status::Invalid("Not support this config " + it.first);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Suggested change
return Status::Invalid("Not support this config " + it.first);
return Status::Invalid("Config " + it.first + " is not supported.");

}

if (!literal.has_map()) {
return Status::Invalid("Literal does not have map");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
return Status::Invalid("Literal does not have map");
return Status::Invalid("Literal does not have a map");

#endif
default:
std::string error_message =
"illegal file format id: " + std::to_string(file_format_id);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"illegal file format id: " + std::to_string(file_format_id);
"Illegal file format id: " + std::to_string(file_format_id);

* @param config config map
* @return bufer to jni call argument, should be DirectByteBuffer
*/
default ByteBuffer serializeMap(Map<String, String> config) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function just written to pass a Java Map to C++ via JNI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

} else if (key == "quoting") {
options->parse_options.quoting = parseBool(value);
} else if (key == "column_type") {
int64_t schema_address = std::stol(value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for possible -1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it in Java side to not add invalid schema address


import io.substrait.proto.Expression;

public class StringMapNode implements Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at the functionality, I think what we have here is a util class which converts a particular map config to a particular Substrait protobuf message. Since this can be used in other cases, it could come under substrait.util package. And the toProtobuf could be mapToExpressionLiteral() ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have doubts about having a separate class for this purpose though.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only added a few comments. But I am going to go through the content once more.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels May 22, 2024
@@ -43,7 +45,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance.
* @see FileFormat
*/
public native long makeFileSystemDatasetFactory(String uri, int fileFormat);
public native long makeFileSystemDatasetFactory(String uri, int fileFormat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update Java docs?

@@ -54,7 +57,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::FileSystemDatasetFactory instance.
* @see FileFormat
*/
public native long makeFileSystemDatasetFactory(String[] uris, int fileFormat);
public native long makeFileSystemDatasetFactoryWithFiles(String[] uris, int fileFormat,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update Java docs

@@ -80,7 +80,8 @@ private JniWrapper() {
* @return the native pointer of the arrow::dataset::Scanner instance.
*/
public native long createScanner(long datasetId, String[] columns, ByteBuffer substraitProjection,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update Java docs?

ByteBuffer serialize();

/**
* serialize the map.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
* serialize the map.
* Serialize the map.

assertEquals(schema.getFields(), reader.getVectorSchemaRoot().getSchema().getFields());
int rowCount = 0;
while (reader.loadNextBatch()) {
assertEquals("[1, 2, 3]", reader.getVectorSchemaRoot().getVector("Id").toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we check all columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. the Id column is enough, the test tests delimiter config and schema of Id change from default int64 to int32.

Copy link
Collaborator

@vibhatha vibhatha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Added a few more comments.

@vibhatha
Copy link
Collaborator

@jinchengchenghh will re-review this later today.

@jinchengchenghh sorry about the unexpected delay. I have added a few comments today.

@vibhatha vibhatha requested a review from lidavidm May 31, 2024 23:48
zhztheplayer pushed a commit to apache/incubator-gluten that referenced this pull request Jun 3, 2024
Support basic option now, will support more options after arrow patch merged.

apache/arrow#41646

Before this patch, if the required schema is different with file schema, csv read will fallback.
And changed to use index in file instead of check the file column name considering case sensitive.
Add a new common test function when the rule applies to Logical plan.

Compile arrow with version 15.0.0-gluten, upgrade arrow-dataset and arrow-c-data version from 15.0.0 to 15.0.0-gluten.
*
* @return Substrait Expression
* @return substrait Expression Literal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
* @return substrait Expression Literal
* @return Substrait Expression Literal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the brief comments starts with uppercase, but param and returns starts with lower case, which is the expected syntax? https://github.com/apache/arrow/blob/main/java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java#L80

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, you're right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Substrait" is a proper noun, though

@vibhatha
Copy link
Collaborator

vibhatha commented Jun 3, 2024

@jinchengchenghh shall we rebase and see if we can run the CIs?

@vibhatha
Copy link
Collaborator

vibhatha commented Jun 3, 2024

@kou there is a C GLib & Ruby CI failing. I am not sure if this related though.
Also there is a failure in continuous integration which is due to a conda issue.

@jinchengchenghh
Copy link
Contributor Author

I suppose not, the newly added commit 1360221 only involves comments.

@kou
Copy link
Member

kou commented Jun 3, 2024

@vibhatha #41903

In general, you can search existing issue when you find a CI failure that may be unrelated to a PR change. If you don't find an existing issue, you can open a new issue for it.

@kou kou changed the title GH-28866: [JAVA] Java Dataset API ScanOptions expansion GH-28866: [Java] Java Dataset API ScanOptions expansion Jun 3, 2024
@vibhatha
Copy link
Collaborator

vibhatha commented Jun 3, 2024

@vibhatha #41903

In general, you can search existing issue when you find a CI failure that may be unrelated to a PR change. If you don't find an existing issue, you can open a new issue for it.

Thanks @kou will do that.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 3, 2024
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't maintain a high standard of docs, but it would be appreciated to have basic docstrings for new classes/methods

@@ -25,6 +25,8 @@
<arrow.cpp.build.dir>../../../cpp/release-build/</arrow.cpp.build.dir>
<parquet.version>1.13.1</parquet.version>
<avro.version>1.11.3</avro.version>
<substrait.version>0.31.0</substrait.version>
<protobuf.version>3.25.3</protobuf.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not already defined by dependencyManagement elsewhere?

const std::unordered_map<std::string, std::string>& configs) {
switch (file_format_id) {
#ifdef ARROW_CSV
case 3:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some enum/constant that defines the possible values?

/// \param[in] buf a buffer containing the protobuf serialization of a Substrait Literal
/// \param[out] out deserialize to this map.
ARROW_ENGINE_EXPORT Status
DeserializeMap(const Buffer& buf, std::unordered_map<std::string, std::string>& out);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all feels very specific to the internals of the JNI bindings. I don't think this or the new FromProto should be in the public API or the Arrow libraries, ideally, but in the JNI bindings instead.

@@ -85,6 +85,10 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
struct ARROW_DS_EXPORT CsvFragmentScanOptions : public FragmentScanOptions {
std::string type_name() const override { return kCsvTypeName; }

/// \brief Construct FragmentScanOptions from config map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd document the possible options? But also, this also feels like something that should be inside the JNI bindings specifically and not the public API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants