This document demonstrates how to use Hive on EMR to create databases and tables, import data, and perform basic queries.
Development Preparation
Make sure you have activated Tencent Cloud and created an EMR cluster. For more details, see Creating a Cluster. During the creation of an EMR cluster, select the Hive component in the software configuration interface.
In the example, there are optional contents that require access to Tencent Cloud Object Storage (COS). You can see Creating a Bucket to create a bucket in COS, and enable COS authorization on the Instance Information page of the EMR console. Note:
You can see Logging in to a Linux Instance for the method to log in to an EMR node. On the cluster details page, select Cluster Resources > Resource Management, click the corresponding node resource ID to enter the CVM list, and click Login on the right to use WebShell to log in to the instance. The default username for logging in to a Linux instance is root, and the password is the one set by the user during the creation of the EMR. After entering the correct information, you will be taken to the command line interface.
All operations in this document are performed as the hadoop user. Switch the user identity after logging in to the command line interface.
Preparing Sample Data
Log in to the Master node and use the following command in the EMR command line to switch to the hadoop user and navigate to the Hive folder:
su hadoop
cd /usr/local/service/hive
Create a bash script file named gen_data.sh and add the following code to it:
#!/bin/bash
MAXROW=1000000
for((i = 0; i < $MAXROW; i++))
do
echo $RANDOM, \\"$RANDOM\\"
done
Grant permissions and execute the script as follows. This script will generate 1,000,000 pairs of random numbers and save them to the file hive_test.data:
chmod +x script name
./gen_data.sh > hive_test.data
Use the following command to upload the generated test data to HDFS, where ${hdfspath} is the path on HDFS where you want to store the file:
hdfs dfs -put ./hive_test.data /${hdfspath}
You can also use data stored in COS. Upload the data to COS. If the data is stored locally, you can use the COS console to upload it. If the data is in the EMR cluster, use the following command to upload the data, where ${bucketname} is the name of the COS bucket you created:
hdfs dfs -put ./hive_test.data cosn://${bucketname}/
Hive Basic Operations
Log in to the Master node of the EMR cluster, switch to the hadoop user, and enter the Hive command line using the Hive client:
Creating Databases and Tables
Use the SHOW syntax to display all current databases:
hive> show databases;
OK
default
Time taken: 0.26 seconds, Fetched: 1 row(s)
Use the CREATE DATABASE syntax to create a database named test:
hive> create database if not exists test;
OK
Time taken: 0.176 seconds
Use the USE syntax to switch to the created test database:
hive> use test;
OK
Time taken: 0.176 seconds
Use the CREATE TABLE syntax to create an internal table named hive_test under the test database:
hive> create table hive_test (a int, b string)
hive> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
OK
Time taken: 0.204 seconds
Finally, use the SHOW TABLES syntax to check if the table has been created successfully:
hive> show tables;
OK
hive_test
Time taken: 0.176 seconds, Fetched: 1 row(s)
Importing Data
For data stored in HDFS, use the following command to import it into the table:
hive> load data inpath "/${hdfspath}/hive_test.data" into table hive_test;
For data stored in COS, use the following command to import it into the table:
hive> load data inpath "cosn://${bucketname}/hive_test.data" into table hive_test;
Data stored locally in the EMR cluster can also be imported into Hive using the following command:
hive>load data local inpath "/${localpath}/hive_test.data" into table hive_test;
Note:
${hdfspath} is the path where your file is stored on HDFS, ${bucketname} is your COS bucket name, and ${localpath} is the path where your data is stored locally on the EMR cluster. After the import is completed, the source data will be deleted.
Executing Queries
Query the first 10 elements in the table:
hive> select * from hive_test limit 10;
OK
30847 "31583"
14887 "32053"
19741 "16590"
8104 "20321"
29030 "32724"
27274 "5231"
10028 "22594"
924 "32569"
10603 "27927"
4018 "30518"
Time taken:2.133 seconds, Fetched:10 row(s)
Count the total number of rows in the table:
hive> select count(*) from hive_test;
OK
1000000
Time taken:18.504 seconds, Fetched:1 row(s)
Deleting Databases and Tables
Use the DROP TABLE syntax to delete a Hive table:
hive> drop table if exists hive_test;
Moved: 'hdfs://HDFS/usr/hive/warehouse/hive_test' to trash at: hdfs:
OK
Time taken: 2.327 seconds
Use the DROP DATABASE syntax to delete a Hive database:
hive> drop database if exists test;
OK
Time taken: 0.531 seconds
Was this page helpful?