Hive is a data warehouse architecture built on the Hadoop file system (HDFS). It provides many features for data warehouse management, including data ETL (extraction, transformation, and loading), data storage management, and query and analysis for large data sets. In addition, it also defines the SQL-like language Hive-SQL, which allows users to perform operations similar to those in SQL. Hive-SQL maps structured data files to a single database table and provides simple SQL query capabilities. It also enables developers to easily use Mapper and Reducer operations to convert SQL statements into MapReduce jobs. This is a powerful support for the MapReduce framework, as the learning cost is low, and MapReduce statistics can be quickly collected with ease through SQL-like statements, eliminating the need to develop dedicated MapReduce applications. All these advantages make Hive very suitable for statistical analysis of data warehouses.
Hive uses Hadoop's HDFS as its file storage system, making it easy to expand the storage capacity and increase the computing power. It can achieve the same horizontal stability as that of Hadoop, so that a cluster of thousands of servers can be built conveniently. In general, Hive is designed for mining massive amounts of data, but its real-time performance is relatively poor.
Hive's internal and external tables:
usr/hive/warehouse/$tablename
. This path is in HDFS, where $tablename
is the name of the table you created. Simply load the files matching the table definition into this directory, and they can be queried with Hive-SQL.This document demonstrates how to create tables in an EMR cluster and query them through Hive.
SecretId
and SecretKey
below, which can be viewed on the API Key Management page. If there is no key yet, click Create Key to create one./usr/local/service/
.Log in to any node (preferably a master one) in the EMR cluster first. For more information on how to log in to EMR, please see Logging in to Linux Instance Using Standard Login Method. You can choose to log in with WebShell. Click "Log in" on the right of the desired CVM instance to enter the login page. The default username is root
, and the password is the one you set when creating the EMR cluster. Once the correct information is entered, you can enter the command line interface.
Run the following command on the EMR command line interface to switch to the Hadoop user, then go to the Hive folder:
[root@172 ~]# su hadoop
[hadoop@172 ~]# cd /usr/local/service/hive
Create a bash script file named gen_data.sh
and add the following code to it:
#!/bin/bash
MAXROW=1000000 # Specify the number of data rows to be generated
for((i = 0; i < $MAXROW; i++))
do
echo $RANDOM, \"$RANDOM\"
done
And run it as follows:
[hadoop@172 hive]$ chmod +x script name
[hadoop@172 hive]$ ./gen_data.sh > hive_test.data
This script file will generate 1,000,000 random number pairs and save them to the hive_test.data
file.
$hdfspath
is the path of your file on HDFS.[hadoop@172 hive]$ hdfs dfs -put ./hive_test.data /$hdfspath
$bucketname
is the name of the COS bucket you created.[hadoop@172 hive]$ hdfs dfs -put ./hive_test.data cosn://$bucketname/
Log in to a master node of the EMR cluster, switch to the Hadoop user, go to the Hive directory, and connect to Hive:
[hadoop@172 hive]$ su hadoop
[hadoop@172 hive]$ cd /usr/local/service/hive/bin
[hadoop@172 bin]$ hive
You can use the -h
parameter to get basic information on Hive commands. You can also use the Beeline mode to connect to a database. To do so, you also need to log in to a master node in EMR, switch to the Hadoop user, and go to the Hive directory. In the conf/hive-site.xml
configuration file, get the connection port $port
and host address $host
of Hive server 2:
<property>
<name>hive.server2.thrift.bind.host</name>
<value>$host</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>$port</value>
</property>
In the bin directory, run the following statement to connect to Hive:
[hadoop@172 hive]$ cd bin
[hadoop@172 bin]$ ./beeline -u "jdbc:hive2:// $host: $port " -n hadoop -p hadoop
You run the same Hive-SQL statements in Hive mode and the Beeline mode. The following example shows how to run Hive-SQL statements in Hive mode. Run the following command in Hive mode to view the database:
hive> show databases;
OK
default
Time taken: 0.26 seconds, Fetched: 1 row(s)
Run the create
command to create a database:
hive> create database test; # Create a database named test
OK
Time taken: 0.176 seconds
Run the use
command to go to the test database you just created:
hive> use test;
OK
Time taken: 0.176 seconds
Run the create
command to create an internal table named hive_test
in the test database:
hive> create table hive_test (a int, b string)
hive> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
# Create a data table named hive_test and specify the column separator as ','
OK
Time taken: 0.204 seconds
There is only one command. If you do not enter the semicolon ";", Hive-SQL can put one command in multiple lines for input. Finally, you can run the following command to see whether the table has been created successfully:
hive> show tables;
OK
hive_test
Time taken: 0.176 seconds, Fetched: 1 row(s)
For data stored in HDFS, run the following command to import it into the table:
hive> load data inpath "/$hdfspath/hive_test.data" into table hive_test;
Here, $hdfspath
is the path of your file in HDFS. After the import is completed, the source data file in the import path in HDFS will be deleted.
For data stored in COS, run the following command to import it into the table:
hive> load data inpath "cosn://$bucketname/hive_test.data" into table hive_test;
Here, $bucketname
is your bucket name plus the path of the data in the bucket.
Also, after the import is completed, the source data file in the import path in COS will be deleted. You can also import data stored locally in the EMR cluster into Hive by running the following command:
hive>load data local inpath "/$localpath/hive_test.data" into table hive_test;
Here, $localpath
is the path of the data locally stored in the EMR cluster. After the import is completed, the source data will be deleted.
Run the select
command to perform a query and count the number of data rows in a table:
hive> select count(*) from hive_test;
Query ID = hadoop_20170316142922_967b5f0e-1f89-4464-bfa3-b6ed53273fc2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1489458311206_9869, Tracking URL =
http://10.0.1.125:5004/proxy/application_1489458311206_9869/
Kill Command = /usr/local/service/hadoop/bin/hadoop job -kill job_1489458311206_9869
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-03-16 14:29:29,023 Stage-1 map = 0%, reduce = 0%
2017-03-16 14:29:34,208 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.87 sec
2017-03-16 14:29:40,404 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.79 sec
MapReduce Total cumulative CPU time: 5 seconds 790 msec
Ended Job = job_1489458311206_9869
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.79 sec
HDFS Read: 40974623 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 790 msec
OK
1000000
Time taken: 18.504 seconds, Fetched: 1 row(s)
The final output is 1000000.
Run the select
command to query the first 10 elements in the table:
hive> select * from hive_test limit 10;
OK
30847 "31583"
14887 "32053"
19741 "16590"
8104 "20321"
29030 "32724"
27274 "5231"
10028 "22594"
924 "32569"
10603 "27927"
4018 "30518"
Time taken: 2.133 seconds, Fetched: 10 row(s)
Run the drop
command to delete a Hive table:
hive> drop table hive_test;
Moved: 'hdfs://HDFS/usr/hive/warehouse/hive_test' to trash at: hdfs://HDFS/user/hadoop/.Trash/Current
OK
Time taken: 2.327 seconds
For more information on Hive operations, please see the official documentation.
Was this page helpful?