This document introduces three ways to connect to Hive in EMR: using the Hive client, Beeline, and Java.
Development Preparation
Make sure you have activated Tencent Cloud and created an EMR cluster. For details, see Create Cluster. During the creation of an EMR cluster, select the Hive component in the software configuration interface.
Note:
You can see Logging in to Linux Instance for the method to log in to an EMR node. On the cluster details page, select Cluster Resources > Resource Management, click the corresponding node resource ID to enter the CVM list, and click Login on the right to use WebShell to log in to the instance. The default username for logging in to a Linux instance is root, and the password is the one set by the user during the creation of the EMR. After entering the correct information, you will be taken to the command line interface.
All operations in this document are performed as the hadoop user. Switch the user identity by using the su hadoop command after logging in to the command line interface.
Connecting to Hive
The Hive service is by default deployed on the Master node. You can also see Scaling Out a Cluster to deploy HiveServer2 on a Router node. This document uses the Master node as an example to connect to the Hive service. Method I: Using the Hive Client
Log in to the EMR cluster’s Master node, switch to the Hadoop user, and execute the following command to enter the Hive command line:
You can also use the -h parameter to get basic information about Hive commands.
Method II: Connecting to HiveServer2 via Beeline
Log in to the EMR cluster’s Master node and connect to Hive using the beeline command:
beeline -u "jdbc:hive2://${hs2_ip}:${hs2_port}" -n hadoop
Note:
1. ${hs2_ip} is the private IP address of the node where the HiveServer2 service is deployed. You can view it on the cluster details page under Cluster Services > Hive > Role Management.
2. ${hs2_port} is the port number of the HiveServer2 in the cluster, with a default value of 7001. You can view it on the cluster details page under Cluster Services > Hive > Configuration Management by checking the hive.server2.thrift.port setting in the hive-site.xml configuration file.
Method III: Connecting to Hive via Java
This document uses Maven as an example to manage your project. Maven is a project management tool that helps you easily manage project dependencies. It retrieves jar files based on the configuration in the pom.xml file, eliminating the need for manual addition.
First, download and install Maven locally, and configure the Maven environment variables. If you use an IDE, set up the related Maven configuration within the IDE. In the local shell, navigate to the directory where you want to create a project, for example, /tmp/mavenWorkplace, and enter the following command to create a Maven project:
mvn archetype:generate -DgroupId=${yourgroupID} -DartifactId=${yourartifactID} -DarchetypeArtifactId=maven-archetype-quickstart
Note:
1. ${yourgroupID} represents your package name, and ${yourartifactID} represents your project name.
2. maven-archetype-quickstart indicates that you are creating a Maven Java project. During the project creation process, some files need to be downloaded, so ensure a stable internet connection.
Among these, we primarily focus on the pom.xml file and the java folder under main. The pom.xml file is mainly used for dependency and packaging configuration, while the Java folder contains your source code.
<dependencies>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>${hive_version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop_version}</version>
</dependency>
</dependencies>
Note:
${hive_version} is the version of Hive in your cluster, and ${hadoop_version} is the version of Hadoop in your cluster.
Next, add the packaging and compilation plugins to the pom.xml file:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>utf-8</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Right-click under src > main > java to create a Java class, enter your class name (in this example, App.java), and add the sample code to the class:
package org.example;
import java.sql.*;
public class App {
private static final String DRIVER_NAME = "org.apache.hive.jdbc.HiveDriver";
public static void main(String[] args) throws SQLException {
try {
Class.forName(DRIVER_NAME);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
}
Connection conn = DriverManager.getConnection("jdbc:hive2://$hs2_ip:$hs2_port/default", "hadoop", "");
Statement stmt = conn.createStatement();
String tableName = "hive_test";
stmt.execute("drop table if exists " + tableName);
stmt.execute("create table " + tableName + " (key int, value string)");
System.out.println("Create table success!");
String sql = "show tables '" + tableName + "'";
System.out.println("Running: " + sql);
ResultSet res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
}
sql = "describe " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1) + "\\t" + res.getString(2));
}
sql = "insert into " + tableName + " values (42,\\"hello\\"),(48,\\"world\\")";
stmt.execute(sql);
sql = "select * from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getInt(1) + "\\t" + res.getString(2));
}
sql = "select count(1) from " + tableName;
System.out.println("Running: " + sql);
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
}
}
}
The program will first connect to the HiveServer2 service, and then create a table named hive_test in the default database. Afterward, it will insert two elements into the table and output the entire content of the table.
If your Maven configuration is correct and the dependencies have been successfully imported, the entire project can be compiled directly. Navigate to the project directory in the local shell and run the following command to package the entire project:
mvn clean package -DskipTests
During the process, some files may need to be downloaded; the appearance of build success indicates successful packaging. You can then find the packaged jar file in the target folder under the project directory.
Uploading and Running the Program
First, you need to upload the compressed jar file to the EMR cluster using scp or sftp tools. Under the local shell, run the following command (enter yes and then enter the password for verification)
scp ${localfile} root@${master_pubilc_ip}:/usr/local/service/hive
Note:
1. ${localfile} is the path and name of your local file, root is the CVM server username, and the public IP address can be found in the EMR console’s node information or in the CVM console.
2. ${master_public_ip} is the public IP address of your cluster’s Master node.
Upload the packaged jar file to the /home/hadoop/ directory in the EMR cluster. After the upload is completed, you can check the corresponding directory in the EMR command line to verify if the file is present.Make sure to upload JAR packages with dependencies.
Log in to the EMR cluster, switch to the hadoop user, and navigate to the /home/hadoop/ directory. Run the program:
yarn jar ./hive-test-1.0-SNAPSHOT-jar-with-dependencies.jar org.example.App
Note:
hive-test-1.0-SNAPSHOT-jar-with-dependencies.jar is the path and name of your jar file, and org.example.App is the package name and class name of the Java class.
The results are as follows:
Create table success!
Running: show tables 'hive_test'
hive_test
Running: describe hive_test
key int
value string
Running: select * from hive_test
42 hello
48 world
Running: select count(1) from hive_test
2
Method III: Connecting to Hive via Java
This document takes the PyHive Project to connect to Hive through Python 3. First, log in to the EMR cluster’s Master node, switch to the root user, and navigate to the /usr/local/service/hive/ directory. Run the following commands to install the required tools and dependencies:
pip3 install sasl
pip3 install thrift
pip3 install thrift-sasl
pip3 install pyhive
After the installation is completed, switch back to the hadoop user. Then, create a Python file named hivetest.py under the /usr/local/service/hive/ directory and add the following code:
from pyhive import hive
import sys
default_encoding = 'utf-8'
conn = hive.connect(host='${hs2_host}',
port='${hs2_port}',
username='hadoop',
password='hadoop',
database='default',
auth="CUSTOM",)
tablename = 'HiveByPython'
cur = conn.cursor()
print("\\n")
print('show the tables in default: ')
cur.execute('show tables')
for i in cur.fetchall():
print(i)
cur.execute('drop table if exists ' + tablename)
cur.execute('create table ' + tablename + ' (key int,value string)')
print("\\n")
print('show the new table: ')
cur.execute('show tables ' +"'" +tablename+"'")
for i in cur.fetchall():
print(i)
print("\\n")
print("contents from " + tablename + ":")
cur.execute('insert into ' + tablename + ' values (42,"hello"),(48,"world")')
cur.execute('select * from ' + tablename)
for i in cur.fetchall():
print(i)
After connecting to HiveServer2, the program first outputs all the databases and then displays the tables in the default database. It creates a table named hivebypython, inserts two records into the table, and outputs the results.
Note:
1. ${hs2_host} is the host ID of the HiveServer2 in the cluster. You can find it in the hive.server2.thrift.bind.host setting in the hive-site.xml configuration file under Cluster Services > Hive > Configuration Management on the cluster details page.
2. ${hs2_port} is the port number of the HiveServer2 in the cluster, with a default value of 7001. You can view it on the cluster details page under Cluster Services > Hive > Configuration Management by checking the hive.server2.thrift.port setting in the hive-site.xml configuration file.
After the file is saved, run the program directly:
You will see the following information output in the command line:
show the tables in default:
show the new table:
('hivebypython',)
contents from HiveByPython:
(42, 'hello')
(48, 'world')
Was this page helpful?