Partition schema mangling for ORC #21

omalley · 2018-03-14T02:05:23Z

No description provided.

rdblue · 2018-03-14T16:26:26Z

What do you mean by "schema mangling"?

omalley · 2018-03-14T16:33:25Z

I wasn't clear on what was required, I just see the reader path for Parquet and Avro mangle the schemas for partitioned tables:

          if (hasJoinedPartitionColumns) {
            // schema used to read data files
            Schema readSchema = TypeUtil.selectNot(requiredSchema, idColumns);
            Schema partitionSchema = TypeUtil.select(requiredSchema, idColumns);
            Schema joinedSchema = TypeUtil.join(readSchema, partitionSchema);
            PartitionRowConverter convertToRow = new PartitionRowConverter(partitionSchema, spec);
            JoinedRow joined = new JoinedRow();

            InternalRow partition = convertToRow.apply(file.partition());
            joined.withRight(partition);

            // create joined rows and project from the joined schema to the final schema
            Iterator<InternalRow> joinedIter = transform(
                newParquetIterator(location, task, readSchema), joined::withLeft);

            unsafeRowIterator = transform(joinedIter,
                APPLY_PROJECTION.bind(projection(finalSchema, joinedSchema))::invoke);

so I assume I need something similar for ORC. I just didn't dig into the details to understand what was happening in your code.

rdblue · 2018-03-16T15:58:15Z

Makes sense. For identity partitions, where the exact value is stored in the manifest file, we join to those values and then project to get the column order to match the table's order (we don't reorder columns because of a limitation in Spark's Parquet read path that we are reusing).

rdblue · 2018-09-09T04:51:11Z

I think a refactor a while back fixed this. We still need to extend the tests for this in Spark to include ORC.

* Add ManifestFile and migrate Snapshot to return it. * Optionally write manifest lists to separate files. This adds a new table property, write.manifest-lists.enabled, that defaults to false. When enabled, new snapshot manifest lists will be written into separate files. The file location will be stored in the snapshot metadata as "manifest-list". * Aggregate partition field summaries when writing manifests. * Add InclusiveManifestEvaluator. This expression evaluator determines whether a manifest needs to be scanned or whether it cannot contain data files matching a partition predicate. * Add file length to ManifestFile. * Ensure files in manifest lists have helpful metadata. This modifies SnapshotUpdate when writing a snapshot with a manifest list file. If files for the manifest list do not have full metadata, then this will scan the manifests to add metadata, including snapshot ID, added/existing/deleted count, and partition field summaries. * Add partitions name mapping when reading Snapshot manifest list. * Update ScanSummary and FileHistory to use ManifestFile metadata. This optimizes ScanSummary and FileHistory to ignore manifests that cannot have changes in the configured time range.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition schema mangling for ORC #21

Partition schema mangling for ORC #21

omalley commented Mar 14, 2018

rdblue commented Mar 14, 2018

omalley commented Mar 14, 2018 •

edited by rdblue

rdblue commented Mar 16, 2018

rdblue commented Sep 9, 2018

Partition schema mangling for ORC #21

Partition schema mangling for ORC #21

Comments

omalley commented Mar 14, 2018

rdblue commented Mar 14, 2018

omalley commented Mar 14, 2018 • edited by rdblue

rdblue commented Mar 16, 2018

rdblue commented Sep 9, 2018

omalley commented Mar 14, 2018 •

edited by rdblue