Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-28655: Implement HMS Related Drop Stats Changes #5578

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DanielZhu58
Copy link
Contributor

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

Is the change a dependency upgrade?

How was this patch tested?

bool delete_table_column_statistics(1:string db_name, 2:string tbl_name, 3:string col_name, 4:string engine) throws
(1:NoSuchObjectException o1, 2:MetaException o2, 3:InvalidObjectException o3,
4:InvalidInputException o4)
bool delete_table_column_statistics_req(1: DeleteTableColumnStatisticsRequest req) throws
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we merge the two methods? i.e, merge them to bool delete_column_statistics_req(DeleteColumnStatisticsRequest req)

* catalog name, database name, table name, partition name, column names, and engine name
* @throws TException thrift transport error
*/
public boolean deletePartitionMultiColumnStatistics(DeletePartitionColumnStatisticsRequest req) throws TException;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to merge the deletePartitionMultiColumnStatistics and deleteTableMultiColumnStatistics, and keep only one method, i.e, a deleteMultiColumnStatistics(DeleteColumnStatisticsRequest req) for both dropping table statistics and partition statistics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes the user just want to drop the stats in a specific partition.
In some cases, a certain partition stats becomes really huge so we want to drop stats for it only.
The 2 methods are designed for different use cases.

* @throws InvalidObjectException error dropping the stats
* @throws InvalidInputException bad input, such as null table or database name.
*/
boolean deletePartitionColumnStatisticsInBatch(String catName, String dbName, String tableName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between deletePartitionMultiColumnStatistics and deletePartitionColumnStatisticsInBatch, can them merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I think we can merge these 2 methods.

@@ -1395,9 +1395,50 @@ List<ColumnStatistics> getPartitionColumnStatistics(
* @throws InvalidInputException bad input, such as null table or database name.
*/
boolean deletePartitionColumnStatistics(String catName, String dbName, String tableName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can make this method as default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.

import java.util.Optional;
import java.util.Set;
import java.util.Stack;
import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: restore the import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged.


getMS().openTransaction();
try {
List<String> partVals = getPartValsFromName(getMS(), parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName, partName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this partName in thrift defines as optional, should we take care of the nullable partName. To make the interface more generic, can we define the partName as List<String>, so we can delete multiple partitions' statistics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't need to delete the stats for multiple partitions, because currently there is no syntax for the user to drop stats for multiple partitions in HQL.
And also in the use cases, it's not very common that stats of multiple adjacent partitions become huge. The user can drop stats of multiple partitions one by one manually.
I think this partName here is not optional.
In hive_metastore.thrift,
struct DeletePartitionColumnStatisticsRequest { 1: required string cat_name, 2: required string db_name, 3: required string tbl_name, 4: required string part_name, 5: optional list<string> col_names, 6: required string engine }

Table table = getMS().getTable(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName);
// This API looks unused; if it were used we'd need to update stats state and write ID.
// We cannot just randomly nuke some txn stats.
if (TxnUtils.isTransactionalTable(table)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we drop the transactional statistics? can we make the partition stats as incorrect in partition params by StatsSetupConst.COLUMN_STATS_ACCURATE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comments are not written by me. We can mark this issue as to do, and discuss about it later.

new DeletePartitionColumnStatEvent(parsedDbName[CAT_NAME], parsedDbName[DB_NAME], tableName,
partName, partVals, colName, engine, this));
}
if (!listeners.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this call be outside of the transaction? why do we notify the listener just on success and colNames is not null, will some events be missing?

throw new NoSuchObjectException("Partition " + partName
+ " for which stats deletion is requested doesn't exist");
}
query = pm.newQuery(MPartitionColumnStatistics.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this query?

query = pm.newQuery(MPartitionColumnStatistics.class);
if (colNames != null) {
for (String colName : colNames){
deletePartitionColumnStatistics(catName, dbName, tableName, partName, partVals, colName, engine);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we drop all the columns' statistics in a round?

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, the thrift code needs to be generated as there is change in hive_metastore.thrift. Please use -Pthriftif profile.

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

@DanielZhu58
Copy link
Contributor Author

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

Hi @Aggarwal-Raghav , yes, the method changes in this patch are mainly focused on the HMS API changes. It's only the part of the big change.
There is another patch included the HiveParser.g changes coming soon, authored by @ramesh0201
#5176

@Aggarwal-Raghav
Copy link
Contributor

@DanielZhu58, one question, to call the drop stats command it will be something like: ALTER TABLE .. drop col/partition stats... So antlr file changes (HiveParser.g) are required right?

Hi @Aggarwal-Raghav , yes, the method changes in this patch are mainly focused on the HMS API changes. It's only the part of the big change. There is another patch included the HiveParser.g changes coming soon, authored by @ramesh0201 #5176

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants