Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition schema mangling for ORC #21

Open
omalley opened this issue Mar 14, 2018 · 4 comments
Open

Partition schema mangling for ORC #21

omalley opened this issue Mar 14, 2018 · 4 comments

Comments

@omalley
Copy link
Contributor

omalley commented Mar 14, 2018

No description provided.

@rdblue
Copy link
Contributor

rdblue commented Mar 14, 2018

What do you mean by "schema mangling"?

@omalley
Copy link
Contributor Author

omalley commented Mar 14, 2018

I wasn't clear on what was required, I just see the reader path for Parquet and Avro mangle the schemas for partitioned tables:

          if (hasJoinedPartitionColumns) {
            // schema used to read data files
            Schema readSchema = TypeUtil.selectNot(requiredSchema, idColumns);
            Schema partitionSchema = TypeUtil.select(requiredSchema, idColumns);
            Schema joinedSchema = TypeUtil.join(readSchema, partitionSchema);
            PartitionRowConverter convertToRow = new PartitionRowConverter(partitionSchema, spec);
            JoinedRow joined = new JoinedRow();

            InternalRow partition = convertToRow.apply(file.partition());
            joined.withRight(partition);

            // create joined rows and project from the joined schema to the final schema
            Iterator<InternalRow> joinedIter = transform(
                newParquetIterator(location, task, readSchema), joined::withLeft);

            unsafeRowIterator = transform(joinedIter,
                APPLY_PROJECTION.bind(projection(finalSchema, joinedSchema))::invoke);

so I assume I need something similar for ORC. I just didn't dig into the details to understand what was happening in your code.

@rdblue
Copy link
Contributor

rdblue commented Mar 16, 2018

Makes sense. For identity partitions, where the exact value is stored in the manifest file, we join to those values and then project to get the column order to match the table's order (we don't reorder columns because of a limitation in Spark's Parquet read path that we are reusing).

@rdblue
Copy link
Contributor

rdblue commented Sep 9, 2018

I think a refactor a while back fixed this. We still need to extend the tests for this in Spark to include ORC.

Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/iceberg that referenced this issue Apr 12, 2019
* Add ManifestFile and migrate Snapshot to return it.
* Optionally write manifest lists to separate files.
    This adds a new table property, write.manifest-lists.enabled, that
    defaults to false. When enabled, new snapshot manifest lists will be
    written into separate files. The file location will be stored in the
    snapshot metadata as "manifest-list".
* Aggregate partition field summaries when writing manifests.
* Add InclusiveManifestEvaluator.
    This expression evaluator determines whether a manifest needs to be
    scanned or whether it cannot contain data files matching a partition
    predicate.
* Add file length to ManifestFile.
* Ensure files in manifest lists have helpful metadata.
    This modifies SnapshotUpdate when writing a snapshot with a manifest
    list file. If files for the manifest list do not have full metadata,
    then this will scan the manifests to add metadata, including snapshot
    ID, added/existing/deleted count, and partition field summaries.
* Add partitions name mapping when reading Snapshot manifest list.
* Update ScanSummary and FileHistory to use ManifestFile metadata.
    This optimizes ScanSummary and FileHistory to ignore manifests that
    cannot have changes in the configured time range.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants