tencent cloud

All product documents
Data Lake Compute
Collection Functions
Last updated: 2024-08-07 17:33:09
Collection Functions
Last updated: 2024-08-07 17:33:09

APPROX_COUNT_DISTINCT

Function statement:
APPROX_COUNT_DISTINCT(<expr> any [, <relativeSD> integer|double|decimal])
Supported engine: SparkSQL
Usage instructions: Return the cardinality estimated by HyperLogLog++. relativeSD defines the allowable maximum relative standard deviation.
Return type: bigint
Example:
SELECT approx_count_distinct(col1) FROM (VALUES (1), (1), (2), (2), (3)) tab(col1);
3

AVG

Function statement:
AVG(<expr> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the average value calculated based on the group's values.
Return type: double
Example:
> SELECT avg(col) FROM (VALUES (1), (2), (3)) AS tab(col);
2.0
> SELECT avg(col) FROM (VALUES (1), (2), (NULL)) AS tab(col);
1.5

CORR

Function statement:
CORR(<expr> integer|double|decimal, <expr> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the Pearson correlation coefficient between a set of numeric pairs.
Return type: double
Example:
> SELECT corr(c1, c2) FROM (VALUES (3, 2), (3, 3), (6, 4)) as tab(c1, c2);
0.8660254037844387

COUNT

Function statement:
COUNT(*)
COUNT([DISTINCT] <col1> ANY, <col2> ANY, ...)
Supported engines: SparkSQL and Presto
Usage instructions: COUNT(*): Return the total number of retrieved rows, including rows with null values. COUNT(<col1> ANY, <col2> ANY, ...): Return the number of rows where all provided expressions are non-null. COUNT([DISTINCT] <col1> ANY, <col2> ANY, ...): Return the number of unique and non-null rows for the provided expressions.
Return type: integer
Example:
> SELECT count(*) FROM (VALUES (NULL), (5), (5), (20)) AS tab(col);
4
> SELECT count(col) FROM (VALUES (NULL), (5), (5), (20)) AS tab(col);
3
> SELECT count(DISTINCT col) (FROM VALUES (NULL), (5), (5), (10)) AS tab(col);
2

COUNT_IF

Function statement:
COUNT_IF(<expr> ANY)
Supported engines: SparkSQL and Presto
Usage instructions: Return the number of rows where the expression is TRUE.
Return type: int
Example:
> SELECT count_if(col % 2 = 0) FROM (VALUES (NULL), (0), (1), (2), (3)) AS tab(col);
2
> SELECT count_if(col IS NULL) FROM (VALUES (NULL), (0), (1), (2), (3)) AS tab(col);
1

COVER_POP

Function statement:
COVAR_POP(<expr1> integer|double|decimal, <expr2> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the population covariance of a set of numeric pairs.
Return type: double
Example:
> SELECT covar_pop(c1, c2) FROM (VALUES (1,1), (2,2), (3,3)) AS tab(c1, c2);
0.6666666666666666

COVER_SAMP

Function statement:
COVAR_SAMP(<expr1> integer|double|decimal, <expr2> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sample covariance of a set of numeric pairs.
Return type: double
Example:
> SELECT covar_samp(c1, c2) FROM (VALUES (1,1), (2,2), (3,3)) AS tab(c1, c2);
1.0

FIRST_VALUE

Function statement:
FIRST_VALUE(<expr> T[, <isIgnoreNull> boolean])
Supported engines: SparkSQL and Presto
Usage instructions: Return the first value of expr in a set of rows. If isIgnoreNull is true, only non-null values are returned.
Return type: T
Example:
> SELECT first_value(col) FROM (VALUES (10), (5), (20)) AS tab(col);
10
> SELECT first_value(col) FROM (VALUES (NULL), (5), (20)) AS tab(col);
NULL
> SELECT first_value(col, true) FROM (VALUES (NULL), (5), (20)) AS tab(col);
5

FIRST

Function statement:
FIRST(<expr> T[, <isIgnoreNull> boolean])
Supported engine: SparkSQL
Usage instructions: Return the first value of expr in a set of rows. If isIgnoreNull is true, only non-null values are returned.
Return type: T
Example:
> SELECT first(col) FROM (VALUES (10), (5), (20)) AS tab(col);
10
> SELECT first(col) FROM (VALUES (NULL), (5), (20)) AS tab(col);
NULL
> SELECT first(col, true) FROM (VALUES (NULL), (5), (20)) AS tab(col);
5

KURTOSIS

Function statement:
KURTOSIS(<expr> integer|double|decimal)
Supported engine: SparkSQL
Usage instructions: Return the kurtosis value calculated based on the group's values.
Return type: double
Example:
> SELECT kurtosis(col) FROM (VALUES (-10), (-20), (100), (1000)) AS tab(col);
-0.7014368047529627
> SELECT kurtosis(col) FROM (VALUES (1), (10), (100), (10), (1)) as tab(col);
0.19432323191699075s

LAST_VALUE

Function statement:
LAST_VALUE(<expr> T[, <isIgnoreNull> boolean])
Supported engines: SparkSQL and Presto
Usage instructions: Return the last value of expr in a set of rows. If isIgnoreNull is true, only non-null values are returned.
Return type: T
Example:
> SELECT last_value(col) FROM (VALUES (10), (5), (20)) AS tab(col);
20
> SELECT last_value(col) FROM (VALUES (10), (5), (NULL)) AS tab(col);
NULL
> SELECT last_value(col, true) FROM (VALUES (10), (5), (NULL)) AS tab(col);
5

LAST

Function statement:
LAST(<expr> T[, <isIgnoreNull> boolean])
Supported engine: SparkSQL
Usage instructions: Return the last value of expr in a set of rows. If isIgnoreNull is true, only non-null values are returned.
Return type: T
Example:
> SELECT last(col) FROM (VALUES (10), (5), (20)) AS tab(col);
20
> SELECT last(col) FROM (VALUES (10), (5), (NULL)) AS tab(col);
NULL
> SELECT last(col, true) FROM (VALUES (10), (5), (NULL)) AS tab(col);
5

MEAN

Function statement:
MEAN(<expr> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the average value calculated based on the group's values.
Return type: double
Example:
> SELECT mean(col) FROM (VALUES (1), (2), (3)) AS tab(col);
2.0
> SELECT mean(col) FROM (VALUES (1), (2), (NULL)) AS tab(col);
1.5

PERCENTILE

Function statement:
PERCENTILE(<col> ANY, <percentage> integer|double|decimal|array<double> [, <frequency> integer])
Supported engines: SparkSQL and Presto
Usage instructions: Return the exact percentile value of the numeric column col at the given percentage. The percentage value must be between 0.0 and 1.0. frequency should be a positive integer.
Return type: double
Example:
> SELECT percentile(col, 0.3) FROM (VALUES (0), (10)) AS tab(col);
3.0
> SELECT percentile(col, array(0.25, 0.75)) FROM (VALUES (0), (10)) AS tab(col);
[2.5,7.5]

SKEWNESS

Function statement:
SKEWNESS(<col> integer|double|decimal)
Supported engine: SparkSQL
Usage instructions: Return the skewness value calculated based on the group's values.
Return type: double
Example:
> SELECT skewness(col) FROM (VALUES (-10), (-20), (100), (1000)) AS tab(col);
1.1135657469022011
> SELECT skewness(col) FROM (VALUES (-1000), (-100), (10), (20)) AS tab(col);
-1.1135657469022011

PERCENTILE_APPROX

Function statement:
PERCENTILE_APPROX(<col> integer|double|decimal, <percentage> double|array<double>[, <accuracy> integer])
Supported engines: SparkSQL and Presto
Usage instructions: Return the approximate percentile of the numeric column col, which is the minimum value in the ordered col values (sorted from minimum to maximum) such that no more than the given percentage of col values are less than or equal to this value. The percentage value must be between 0.0 and 1.0. The accuracy parameter (default value: 10,000) is a positive numeric literal that controls the approximation accuracy at the cost of memory. The higher the accuracy value, the better the precision. 1.0/accuracy is the approximate relative error. When percentage is an array, each value in the percentage array must be between 0.0 and 1.0. In this case, an array of approximate percentiles for the column col corresponding to the given percentage array is returned.
Return type: integer |array<integer>
Example:
> SELECT percentile_approx(col, array(0.5, 0.4, 0.1), 100) FROM (VALUES (0), (1), (2), (10)) AS tab(col);
[1,1,0]
> SELECT percentile_approx(col, 0.5, 100) FROM (VALUES (0), (6), (7), (9), (10)) AS tab(col);
7

APPROX_PERCENTILE

Function statement:
APPROX_PERCENTILE(<col> integer|double|decimal, <percentage> double|array<double>[, <accuracy> integer])
Supported engines: SparkSQL and Presto
Usage instructions: Return the approximate percentile of the numeric column col, which is the minimum value in the ordered col values (sorted from minimum to maximum) such that no more than the given percentage of col values are less than or equal to this value. The percentage value must be between 0.0 and 1.0. The accuracy parameter (default value: 10,000) is a positive numeric literal that controls the approximation accuracy at the cost of memory. The higher the accuracy value, the better the precision. 1.0/accuracy is the approximate relative error. When percentage is an array, each value in the percentage array must be between 0.0 and 1.0. In this case, an array of approximate percentiles for the column col corresponding to the given percentage array is returned.
Return type: integer |array<integer>
Example:
> SELECT APPROX_PERCENTILE(col, array(0.5, 0.4, 0.1), 100) FROM (VALUES (0), (1), (2), (10)) AS tab(col);
[1,1,0]
> SELECT APPROX_PERCENTILE(col, 0.5, 100) FROM (VALUES (0), (6), (7), (9), (10)) AS tab(col);
7

MAX

Function statement:
MAX(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the maximum value of col.
Return type: same as col
Example:
> SELECT max(col) FROM (VALUES (10), (50), (20)) AS tab(col);
50

MAX_BY

Function statement:
MAX_BY(<x> T, <y> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the x value associated with the maximum y value.
Return type: T
Example:
> SELECT max_by(x, y) FROM (VALUES (('a', 10)), (('b', 50)), (('c', 20))) AS tab(x, y);
b

MIN

Function statement:
MIN(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the minimum value of col.
Return type: same as col
Example:
> SELECT min(col) FROM (VALUES (10), (50), (20)) AS tab(col);
10

MIN_BY

Function statement:
MIN_BY(<x> T, <y> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the x value associated with the minimum y value.
Return type: T
Example:
> SELECT min_by(x, y) FROM (VALUES (('a', 10)), (('b', 50)), (('c', 20))) AS tab(x, y);
a

STD

Function statement:
STD(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto

Usage Instructions

SparkSQL: Return the sample standard deviation calculated based on the group's values.
Presto: Return the population standard deviation calculated based on the group's values.
Return type: double
Example:
> SELECT std(col) FROM (VALUES (1), (2), (3)) AS tab(col);
1.0

STDDEV

Function statement:
STDDEV(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sample standard deviation calculated based on the group's values.
Return type: double
Example:
> SELECT stddev(col) FROM (VALUES (1), (2), (3)) AS tab(col);
1.0

STDDEV_POP

Function statement:
STDDEV_POP(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the population standard deviation calculated based on the group's values.
Return type: double
Example:
> SELECT stddev_pop(col) FROM (VALUES (1), (2), (3)) AS tab(col);
0.816496580927726

STDDEV_SAMP

Function statement:
STDDEV_SAMP(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sample standard deviation calculated based on the group's values.
Return type: double
Example:
> SELECT stddev_samp(col) FROM (VALUES (1), (2), (3)) AS tab(col);
1.0

SUM

Function statement:
SUM(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sum calculated based on the group's values.
Return type: same as col
Example:
> SELECT sum(col) FROM (VALUES (5), (10), (15)) AS tab(col);
30
> SELECT sum(col) FROM (VALUES (NULL), (10), (15)) AS tab(col);
25
> SELECT sum(col) FROM (VALUES (NULL), (NULL)) AS tab(col);
NULL

VARIANCE

Function statement:
VARIANCE(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sample variance calculated based on the group's values.
Return type: double
Example:
> SELECT VARIANCE(col) FROM (VALUES (1), (2), (3)) AS tab(col);
1.0

VAR_POP

Function statement:
VAR_POP(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the population variance calculated based on the group's values.
Return type: double
Example:
> SELECT var_pop(col) FROM (VALUES (1), (2), (3)) AS tab(col);
0.6666666666666666

VAR_SAMP

Function statement:
VAR_SAMP(<col> integer|double|decimal)
Supported engines: SparkSQL and Presto
Usage instructions: Return the sample variance calculated based on the group's values.
Return type: double
Example:
> SELECT var_samp(col) FROM (VALUES (1), (2), (3)) AS tab(col);
1.0

HISTOGRAM_NUMERIC

Function statement:
HISTOGRAM_NUMERIC(<col> integer, <nb> integer)
Supported engine: Presto
Usage instructions: Compute a histogram of a numeric column in the group using nb bins with non-uniform intervals. The output is an array of nb (x, y) coordinates, representing the center and height of the bins.
Return type: array<struct {'x','y'}>
Example:
> SELECT histogram_numeric(col, 5) FROM (VALUES (0), (1), (2), (10)) AS tab(col);
[{"x":0,"y":1.0},{"x":1,"y":1.0},{"x":2,"y":1.0},{"x":10,"y":1.0}]

COLLECT_LIST

Function statement:
COLLECT_LIST(<col> T)
Supported engines: SparkSQL and Presto
Usage instructions: Collect and return a list of non-unique elements.
Return type: array<T>
Example:
> SELECT collect_list(col) FROM (VALUES (1), (2), (1)) AS tab(col);
[1,2,1]

COLLECT_SET

Function statement:
COLLECT_SET(<col> T)
Supported engines: SparkSQL and Presto
Usage instructions: Collect and return a set of unique elements.
Return type: array<T>
Example:
> SELECT collect_set(col) FROM (VALUES (1), (2), (1)) AS tab(col);
[1,2]

COUNT_MIN_SKETCH

Function statement:
COUNT_MIN_SKETCH(<col> T, <eps> double, <confidence> double, <seed> integer)
Supported engine: SparkSQL
Usage instructions: Return the Count-Min Sketch of a column with the given esp, confidence, and seed. The result is binary and can be deserialized into CountMinSketch before use.
Return type: binary
Example:
> SELECT hex(count_min_sketch(col, 0.5d, 0.5d, 1)) FROM (VALUES (1), (2), (1)) AS tab(col);
0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000

EVERY

Function statement:
EVERY(<col> boolean)
Supported engines: SparkSQL and Presto
Usage instructions: Return true if all values of col are true.
Return type: boolean
Example:
> SELECT every(col) FROM (VALUES (true), (true), (true)) AS tab(col);
true
> SELECT every(col) FROM (VALUES (NULL), (true), (true)) AS tab(col);
true
> SELECT every(col) FROM (VALUES (true), (false), (true)) AS tab(col);
false

BOOL_AND

Function statement:
BOOL_AND(<col> boolean)
Supported engines: SparkSQL and Presto
Usage instructions: Return true if all values of col are true.
Return type: boolean
Example:
> SELECT bool_and(col) FROM (VALUES (true), (true), (true)) AS tab(col);
true
> SELECT bool_and(col) FROM (VALUES (NULL), (true), (true)) AS tab(col);
true
> SELECT bool_and(col) FROM (VALUES (true), (false), (true)) AS tab(col);
false

AND

Function statement:
<expr1> AND <expr2>
Supported engines: SparkSQL and Presto

Usage Instructions

Logical AND
Return type: boolean
Example:
> SELECT true and true;
true
> SELECT true and false;
false
> SELECT true and NULL;
NULL
> SELECT false and NULL;
false

OR

Function statement:
<expr1> OR <expr2>
Supported engines: SparkSQL and Presto
Usage instructions: Logical OR
Return type: boolean
Example:
> SELECT true or false;
true
> SELECT false or false;
false
> SELECT true or NULL;
true
> SELECT false or NULL;
NULL

ANY

Function statement:
ANY(<col> boolean)
Supported engines: SparkSQL and Presto
Usage instructions: Return true if at least one value of col is true.
Return type: boolean
Example:
> SELECT any(col) FROM (VALUES (true), (false), (false)) AS tab(col);
true
> SELECT any(col) FROM (VALUES (NULL), (true), (false)) AS tab(col);
true
> SELECT any(col) FROM (VALUES (false), (false), (NULL)) AS tab(col);
false

SOME

Function statement:
SOME(<col> boolean)
Supported engines: SparkSQL and Presto
Usage instructions: Return true if at least one value of col is true.
Return type: boolean
Example:
> SELECT some(col) FROM (VALUES (true), (false), (false)) AS tab(col);
true
> SELECT some(col) FROM (VALUES (NULL), (true), (false)) AS tab(col);
true
> SELECT some(col) FROM (VALUES (false), (false), (NULL)) AS tab(col);
false

BOOL_OR

Function statement:
BOOL_OR(<col> boolean)
Supported engines: SparkSQL and Presto
Usage instructions: Return true if at least one value of col is true.
Return type: boolean
Example:
> SELECT BOOL_OR(col) FROM (VALUES (true), (false), (false)) AS tab(col);
true
> SELECT BOOL_OR(col) FROM (VALUES (NULL), (true), (false)) AS tab(col);
true
> SELECT BOOL_OR(col) FROM (VALUES (false), (false), (NULL)) AS tab(col);
false

BIT_AND

Function statement:
BIT_AND(<col> integer|bigint)
Supported engines: SparkSQL and Presto
Usage instructions: Return the bitwise AND of all non-null input values, or null if there are none.
Return type: same as col
Example:
> SELECT bit_and(col) FROM (VALUES (3), (5)) AS tab(col);
1

BIT_OR

Function statement:
BIT_OR(<col> integer|bigint)
Supported engines: SparkSQL and Presto
Usage instructions: Return the bitwise OR of all non-null input values, or null if there are none.
Return type: same as col
Example:
> SELECT bit_or(col) FROM (VALUES (3), (5)) AS tab(col);
7

BIT_XOR

Function statement:
BIT_XOR(<col> integer|bigint)
Supported engines: SparkSQL and Presto
Usage instructions: Return the bitwise XOR of all non-null input values, or null if there are none.
Return type: same as col
Example:
> SELECT bit_xor(col) FROM (VALUES (3), (5)) AS tab(col);
6

ARG_MIN

Function statement:
ARG_MIN(<col1>, <col2> | expr(col2))
Supported engine: SparkSQ
Usage instructions: Return the row with the minimum value of the specified column col1 based on the specified calculation expression.
col1: For sortable column names, if a constant is provided, any value may be returned. If there are multiple minimum values, one of them is returned randomly.
Return type: same as col2 or expr(col2) type
Example:
> SELECT arg_min(dt, uid) from values (1, 'm1'), (2, 't2'), (3, 'z3') as tab(dt, uid);
m1
> SELECT arg_min(dt, upper(uid)) from values (1, 'm1', 1), (2, 't2', 1), (3, 'z3', 2) as tab(dt, uid, gid) group by gid;
M1
Z3

ARG_MAX

Function statement:
ARG_MAX(<col1>, <col2> | expr(col2))
Supported engine: SparkSQL
Usage instructions: Return the row with the maximum value of the specified column col1 based on the specified calculation expression.
col1: For sortable column names, if a constant is provided, any value may be returned. If there are multiple maximum values, one of them is returned randomly.
Return type: same as col2 or expr(col2) type
Example:
> SELECT arg_max(dt, uid) from values (1, 'm1'), (2, 't2'), (3, 'z3') as tab(dt, uid);
z3
> SELECT arg_max(dt, upper(uid)) from values (1, 'm1', 1), (2, 't2', 1), (3, 'z3', 2) as tab(dt, uid, gid) group by gid;
T2
Z3

MAP_UNION_SUM

Function statement:
MAP_UNION_SUM(map<k, v> input)
Supported engine: SparkSQL
Usage instructions: Specify the column and sum all the values in the map by key.
Note 1: Aggregated data is allowed to be null. If all are null, the output is null.
Note 2: The value of the key is allowed to be null, in which case it will be converted to the number 0.
Return type: same as input type
Example:
> SELECT map_union_sum(ids) from values (map(1, 1)), (map(1,2)), (map(2,1)) as tab(ids);
{1:3,2:1}
> SELECT idx, map_union_sum(ids) from values (map(1, 1), 1), (map(1,2), 1), (map(2,1), 1) as tab(ids, idx) group by idx;
1 {1:3,2:1}

Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback

Contact Us

Contact our sales team or business advisors to help your business.

Technical Support

Open a ticket if you're looking for further assistance. Our Ticket is 7x24 available.

7x24 Phone Support
Hong Kong, China
+852 800 906 020 (Toll Free)
United States
+1 844 606 0804 (Toll Free)
United Kingdom
+44 808 196 4551 (Toll Free)
Canada
+1 888 605 7930 (Toll Free)
Australia
+61 1300 986 386 (Toll Free)
EdgeOne hotline
+852 300 80699
More local hotlines coming soon