tencent cloud

All product documents
Data Lake Compute
Window Functions
Last updated: 2024-08-07 17:33:25
Window Functions
Last updated: 2024-08-07 17:33:25

row_number

Function statement:
row_number()
Supported engines: SparkSQL and Presto
Usage instructions: Assign a unique consecutive number to each row.
Return type: int
Example:

SELECT a, b, row_number() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 1 1
A1 1 2
A1 2 3
A2 3 1

rank

Function statement:
rank()
Supported engines: SparkSQL and Presto
Usage instructions: Compute the rank of a value within a group of values. If there are ties, leaves gaps in the ranking sequence.
Return type: int
Example:
SELECT a, b, rank() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 3
A2 3 1

dense_rank

Function statement:
dense_rank()
Supported engines: SparkSQL and Presto
Usage instructions: Compute the rank of a value within a group of values. Unlike the rank function, dense_rank does not leave gaps in the ranking sequence in the case of ties.
Return type: int
Example:
SELECT a, b, dense_rank() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 2
A2 3 1

percent_rank()

Function statement:
percent_rank()

Supported engines: SparkSQL and Presto
Usage instructions: Compute the percentile rank of a value within a group of values. The return value is a decimal between 0 and 1.
Return type: double
Example:
SELECT a, b, percent_rank() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 0.0
A1 1 0.0
A1 2 1.0
A2 3 0.0

cume_dist

Function statement:
cume_dist()
Supported engines: SparkSQL and Presto
Usage instructions: Compute the position of a value relative to all values within a partition.
Return type: double
Example:
SELECT a, b, cume_dist() OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 0.6666666666666666
A1 1 0.6666666666666666
A1 2 1.0
A2 3 1.0

first_value

Function statement:
first_value(col)
Supported engines: SparkSQL and Presto
Usage instructions: Return the value of the first row of a column in a partition.
Return type: data type of the col column
Example:
SELECT a, b, first_value(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 1
A2 3 3

last_value

Function statement:
last_value(col)
Supported engines: SparkSQL and Presto
Usage instructions: Return the value of the last row of a column in a partition.
Return type: int
Example:
SELECT a, b, last_value(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 2
A2 3 3

lag

Function statement:
lag(col[, n [, default]])
Supported engines: SparkSQL and Presto
Usage instructions: Return the value of the nth row above the current row in the window. The default value for n is 1, and the default value for default is null. If the value of the nth row is null, null is returned. If such an offset row does not exist (e.g., the first row of the window has no row above it when the offset is 1), default is returned. The first parameter is the column name, the second parameter is the nth row before, and the third parameter is the default value.
Return type: data type of the col column
Example:
SELECT a, b, lag(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 NULL
A1 1 1
A1 2 1
A2 3 NULL

lead

Function statement:
lead(col[, n[, default]])
Supported engines: SparkSQL and Presto
Usage instructions: Return the value of the nth row after the current row in the window. The default value for n is 1, and the default value for default is null. If the value of the nth row is null, null is returned. If such an offset row does not exist (e.g., the last row of the window has no row below it when the offset is 1), default is returned. The first parameter is the column name, the second parameter is the nth row before, and the third parameter is the default value.
Return type: data type of the col column
Example:
SELECT a, b, lead(b) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 2
A1 2 NULL
A2 3 NULL

nth_value

Function statement:
nth_value(col[, n])
Supported engines: SparkSQL and Presto
Usage instructions: Return the value of the nth row from the start of the window. n starts from 1. If ignoreNulls=true, null will be skipped when searching for the nth row. Otherwise, every row is counted in n. If the nth row does not exist (e.g., when n is 10 but the window size is less than 10), null is returned. The first parameter is the column name, and the second parameter is the nth row.
Return type: data type of the col column
Example:
SELECT a, b, nth_value(b, 2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 1
A2 3 NULL

ntile

Function statement:
ntile(n)
Supported engines: SparkSQL and Presto
Usage instructions: Divide the rows in the window partition into n buckets and return the bucket number of the row, ranging from 1 to n.
Return type: int
Example:
SELECT a, b, ntile(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)
A1 1 1
A1 1 1
A1 2 2
A2 3 1

CLUSTER_SAMPLE

Function statement:
CLUSTER_SAMPLE(<int> N[, <int> M]) over (PARTITION BY col1 ORDER by col2)
Supported engine: SparkSQL
Usage instructions: Sample within the window according to the specified proportion or quantity.
N: Required, int type. When only N is provided, it indicates sampling N data entries. The sampling result is close to N but not guaranteed to be exactly N.
M: Optional, int type. When M is specified, it indicates sampling the total number of entries in the M/N window. The sampling result is close to total entries of M/N .
Return type: boolean, where true indicates the row is sampled, and false indicates it is not sampled.
Example:
> SELECT a, b, cluster_sample(2) OVER (PARTITION BY a ORDER BY b) FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b);
A1 2 true
A1 1 true
A2 3 true
A1 1 false
> SELECT a, b from (select a, b, cluster_sample(2) OVER (PARTITION BY a ORDER BY b) as c FROM VALUES ('A1', 2), ('A1', 1), ('A2', 3), ('A1', 1) tab(a, b)) where c;
A1 2
A1 1
A2 3


Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support