Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6806: Moving code for a HashAgg partition into separate class. #1515

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ilooner
Copy link
Contributor

@ilooner ilooner commented Oct 29, 2018

Goal

The goal of this change is to move code for managing a partition in HashAgg into a seperate class, similar to the HashPartition class in HashJoinBatch. This has several advantages:

  • Ease of testing: We can test basic partition logic like spilling independently of the rest of the HashAgg operator.
  • Reducing generated code: Currently HashAggTemplate contains the bulk of the code for HashAgg and it is a generated class. The larger the HashAggTemplate class is, the more space that the generated code for HashAgg consumes. Additionally generated code is more difficult to debug during testing, so moving code out of HashAggTemplate into a non-generated class simplifies debugging.
  • Memory Calculations: Moving the code for a partition into a separate class enables us to simplify and unit test the memory calculations in HashAgg.

The original design doc describing the motivation behind the change and how it relates to memory calculations is here.

Implementation HashAggPartition

In order to achieve this I introduced some interfaces:

  • HashAggPartition: This is the interface for a partition. It is useful to provide an interface for the partition so that we can test other pieces of operator logic with a mock partition.
  • HashAggMemoryCalculator: This currently encapsulates the logic for reserving and releasing space space in HashAgg's allocator at various points in the operator's execution. This interface was created so that the HashAggPartition could rely on a clear calculator interface that could be swapped out with different implementations for testing.
  • HashAggBatchAllocator: This interface includes the method that is used to allocate space for an outgoing batch. This interface is implemented by HashAggTemplate and is required by the HashAggPartition class. The interface is necessary because HashAggTemplate is generated at runtime, while the HashAggPartition class is not generated. So HashAggPartition cannot use methods implemented by HashAggTemplate unless they are included in an interface.
  • HashAggBatchHolder: The BatchHolder class is an inner class of HashAggTemplate and is generated. Since the code for HashAggPartition is not generated it can only interact with a BatchHolder through an interface. So the HashAggBatchHolder interface was added.

Then the relevant code for HashAggPartition, and HashAggMemoryCalculator calculator were moved out of HashAggTemplate and into implementation classes HashAggPartitionImpl and HashAggMemoryCalculatorImpl.

The rest of the code changes focus on replacing the usages of the following arrays with a single HashAggPartition array in HashAggTemplate:

  • hashTables
  • batchHolders
  • outBatchIndex
  • writers
  • spilledBatchesCount
  • spillFiles

Next Steps

This change is a first step to moving the code for a partition into a separate HashAggPartition class. I did not move all of it in this change in order to make review easier and to minimize the risk for bugs. A follow up change will complete moving all the code relevant to a partition into HashAggPartition. The main body of code that remains to be moved is most of the HashAggTemplate.checkGroupAndAggrValues function.

Additionally some debugging statements which printed out various partition stats were removed in this change. After the code for HashAggTemplate.checkGroupAndAggrValues is moved into HashAggPartition, I will make the stats available to print for debugging on the HashAggPartition class itself.

@ilooner ilooner requested a review from Ben-Zvi October 29, 2018 22:54
@ilooner
Copy link
Contributor Author

ilooner commented Oct 29, 2018

@Ben-Zvi Please review. This also fixes the code generation errors that I introduced in a previous commit.

@ilooner
Copy link
Contributor Author

ilooner commented Nov 2, 2018

@Ben-Zvi Added description of the changes

Copy link
Contributor

@Ben-Zvi Ben-Zvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the old-new codes; seem like a nice work - +1
One suggestion: Can take some of the implementation descriptions written for the PR and add them as JavaDoc for the new interface/classes.

}

spilledBatchesCount += getBatchHolderCount(); // update count of spilled batches
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment: The original code was tracking and logging/tracing the number of rows spilled (and # batches) at this point. Why was that removed ?

@vdiravka
Copy link
Member

@ilooner Could we proceed with this PR?

@ilooner
Copy link
Contributor Author

ilooner commented Nov 17, 2018

@vdiravka still need to address @Ben-Zvi 's comments. I will try to get it done early next week.

@ilooner
Copy link
Contributor Author

ilooner commented Nov 28, 2018

@vdiravka Got overloaded with some deadlines for another project. I won't be able to finish this for this release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants