SELECT
statement: Retrieve rows from multiple tables.
Statement
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expression [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition ]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
[ LIMIT [ count | ALL ] ]
Parameter
[ WITH with_query [, ....] ]
You can use WITH to flatten nested queries or simplify subqueries. The with_query
statement is as follows:
subquery_table_name [ ( column_name [, ...] ) ] AS (subquery)
subquery_table_name indicates the unique name of a temporary table used to define the results of WITH subqueries. Each subquery must contain a table name that can be referenced in the FROM clause.
column_name [ , ...] indicates an optional list of output column names. The number of column names must be equal to or less than that of columns defined in the subquery.
subquery indicates a query statement.
[ ALL | DISTINCT ] select_expr
The ALL and DISTINCT options specify whether duplicate rows are returned. If none of these options are given, the default value ALL is used, which indicates that all matching rows are returned. DISTINCT specifies that duplicate rows are removed from the result set.
FROM from_item [, ...]
from_item can indicate a view, table, or subquery. If a multi-table join statement is used, supported JOIN statements are as follows:
[ INNER ] JOIN
LEFT [ OUTER ] JOIN
RIGHT [ OUTER ] JOIN
FULL [ OUTER ] JOIN
CROSS JOIN
ON join_condition
: If join_condition
is used, you can specify the column name for the join key of multiple tables. If join_column
is used, the join column must exist in both tables.
[ WHERE condition ]
Results are filtered according to the condition you have specified, and the set of results meeting the condition is returned.
[ GROUP BY [ ALL | DISTINCT ] grouping_expressions [, ...] ]
The GROUP BY clause divides the output into groups based on the specified column names.
[HAVING condition ]
This clause is used in conjunction with aggregate functions and the GROUP BY clause. It controls which groups are selected so as to eliminate groups that do not satisfy the given conditions. It filters groups after groups and aggregates are computed.
[{UNION | INTERSECT | EXCEPT} [ALL | DISTINCT] union_query]
UNION
, INTERSECT
, and EXCEPT
combine the results of more than one statement. UNION
combines all the rows in the result set from the first query with those in the result set from the second query. To eliminate duplication, UNION builds a hash table, which consumes memory resources. UNION ALL is recommended for better performance.
INTERSECT
returns only the rows in the result sets of both the first and the second queries.
EXCEPT
returns the rows in the result set of the first query, but not the second.
[ORDER BY expression [ ASC | DESC ] [ NULLS FIRST | NULLS LAST] [, ...] ]
This statement sorts a result set by one or more output expressions. If the clause contains multiple expressions, sorting is performed using the first expression
. Then, the second expression
is used to sort rows matching the first expression, and so on.
Example
WITH Clause
The WITH clause defines named relations for use within a query. It allows flattening nested queries or simplifying subqueries. For example, the following queries are equivalent:
WITH x AS (SELECT a, MAX(b) AS b FROM t GROUP BY a)
SELECT a, b FROM x;
This also works with multiple subqueries:
WITH
t1 AS (SELECT a, MAX(b) AS b FROM x GROUP BY a),
t2 AS (SELECT a, AVG(d) AS d FROM y GROUP BY a)
SELECT t1.*, t2.*
FROM t1
JOIN t2 ON t1.a = t2.a;
GROUP BY Clause
The GROUP BY clause divides the output of a SELECT statement into groups of rows containing matching values. A simple GROUP BY clause may contain any expression composed of input columns, or it may be an ordinal number selecting an output column by position.
SELECT count(*), nationkey FROM customer GROUP BY 2;
SELECT count(*), nationkey FROM customer GROUP BY nationkey;
SELECT count(*) FROM customer GROUP BY mktsegment;
GROUPING SETS
Grouping sets allow users to specify multiple lists of columns to group on. The columns not part of a given sublist of grouping columns are set to NULL.
SELECT origin_state, origin_zip, destination_state, sum(package_weight)
FROM shipping
GROUP BY GROUPING SETS (
(origin_state),
(origin_state, origin_zip),
(destination_state));
SELECT origin_state, NULL, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state
UNION ALL
SELECT origin_state, origin_zip, NULL, sum(package_weight)
FROM shipping GROUP BY origin_state, origin_zip
UNION ALL
SELECT NULL, NULL, destination_state, sum(package_weight)
FROM shipping GROUP BY destination_state;
HAVING Clause
The HAVING clause is used in conjunction with aggregate functions and the GROUP BY clause to control which groups are selected. A HAVING clause eliminates groups that do not satisfy the given conditions.
SELECT count(*), mktsegment, nationkey,
CAST(sum(acctbal) AS bigint) AS totalbal
FROM customer
GROUP BY mktsegment, nationkey
HAVING sum(acctbal) > 5700000
ORDER BY totalbal DESC;
IN
The IN operator allows users to specify multiple values in the WHERE clause.
SELECT name
FROM nation
WHERE regionkey IN (SELECT regionkey FROM region)
EXISTS
The EXISTS operator determines if a subquery returns any rows. If one or more rows are returned, this operator returns True. Otherwise, it returns False.
SELECT column_name(s)
FROM table_name
WHERE EXISTS
(SELECT column_name FROM table_name WHERE condition)
USING
The USING keyword is used to simplify queries.
The query must be an equal join.
Columns in an equal join must have the same name and data type.
SELECT *
FROM table_1
JOIN table_2
USING (key_A, key_B)
SELECT *
FROM (
VALUES
(1, 3, 10),
(2, 4, 20)
) AS table_1 (key_A, key_B, y1)
LEFT JOIN (
VALUES
(1, 3, 100),
(2, 4, 200)
) AS table_2 (key_A, key_B, y2)
USING (key_A, key_B);
CROSS JOIN
CROSS JOIN returns the Cartesian product (all combinations) of two relations.
SELECT *
FROM nation
CROSS JOIN region
LIMIT Clause
The LIMIT clause restricts the number of rows in the result set.
SELECT orderdate FROM orders LIMIT 5
ORDER BY Clause
The ORDER BY clause sorts a result set by one or more output expressions
Statement: ORDER BY expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...]
SELECT name, age FROM person ORDER BY age
SELECT * FROM student
ORDER BY student_id
SELECT * FROM student
ORDER BY student_id,student_name
EXCEPT
The EXCEPT clause/operator combines two SELECT statements and returns the rows that are in the result set of the first SELECT statement, but not the second SELECT statement. This means that EXCEPT only returns rows and does not apply to the second SELECT statement.
As with the UNION operator, the EXCEPT operator is better when the same rules are applied.
SELECT * FROM (VALUES 13, 42)
EXCEPT
SELECT 13
INTERSECT
INTERSET returns rows that exist in all result sets returned by two or more SELECT statements.
SELECT * FROM (VALUES 13, 42)
INTERSECT
SELECT 1
UNION
UNION combines the result sets of two or more SELECT statements into one result set. To retain duplicate rows in the result sets, use the UNION ALL
operator.
SELECT id FROM a
UNION ALL
SELECT id FROM b;
TABLESAMPLE
BERNOULLI
: Each row is selected to be in the table sample with a probability of the sample percentage. When a table is sampled using the Bernoulli method, all physical blocks of the table are scanned and certain rows are skipped (based on a comparison between the sample percentage and a random value calculated at runtime). The probability of a row being included in the result is independent from any other row. This does not reduce the time required to read the sampled table from disk. It may have an impact on the total query time if the sampled output is processed further.
SYSTEM
: This sampling method divides the table into logical segments of data and samples the table at this granularity. This sampling method either selects all rows from a specific data segment or skips it, based on a comparison between the sampling percentage and a randomly computed value at runtime. The row selected in system sampling depends on the connector used. For example, when Hive is used, it depends on how the data is laid out on HDFS. This method does not guarantee independent sampling probabilities.
SELECT *
FROM users TABLESAMPLE BERNOULLI (50);
SELECT *
FROM users TABLESAMPLE SYSTEM (75);
PIVOT Clause
Returns aggregate values based on specific columns.
SELECT * FROM person
PIVOT (
SUM(age) AS a, AVG(class) AS c
FOR name IN ('John' AS john, 'Mike' AS mike)
);
Lateral View Clause
LATERAL VIEW [ OUTER ] generator_function ( expression [ , ... ] ) [ table_alias ] AS column_alias [ , ... ]
Escaping
To escape a single quote, precede it with another single quote, as shown in the following example:
Specify escape characters or escape sequences when creating a table, for example, using the following method to create:
CREATE EXTERNAL TABLE IF NOT EXISTS `csv_test_2222` (
`_c0` STRING,
`_c1` INTEGER,
`_c2` INTEGER,
`_c3` INTEGER
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'separatorChar' = '''',
'quoteChar' = ''''
)
STORED AS `textfile`
LOCATION 'cosn://dlc-nj-1258469122/csv/100M/
Was this page helpful?