January 13, 2021

Big Data Query Engine -Impala

M_CCM_CC

DURATION

15min

Tags

Topcoder Thrive

Impala is a new query system led by Cloudera. It provides SQL semantics and can query PB-level big data stored in Hadoop’s HDFS and HBase. Based on hive and using memory for calculation, taking into account the data warehouse, it has the advantages of real-time, batch processing, and multiple concurrency.

Impala consists of three processes:

Picture from official website: http://impala.apache.org

Architecture
(Picture from official website: http://impala.apache.org)

Impalad

Impalad is the main work computing process of Impala. It is responsible for receiving client requests, becoming a coordinator, and then parsing query requests, splitting them into different tasks and distributing them to other impalad node processes. After each impalad worker node process receives the request, it starts to perform a local query (such as querying the data-node of HDFS or the region server of H Base), and returns the query results to the central coordinator. The coordinator collects the requested data and returns it to the client.

The impalad service consists of three modules: Query Planner, Query Coordinator and Query Executor.

Query planner - receives query requests from SQL APP, ODBC, etc., and then converts the query into many sub-queries (execution plan), which is equivalent to an agent;
Query coordinator - distributes these sub-queries to each node;
Query executor - it is responsible for the execution of the sub-query, and then returns the results of the sub-query. These intermediate results are finally returned to the user after being aggregated.

State-stored

This process is responsible for collecting the health status of impalad process nodes in the cluster. It maintains connections with each node and continuously forwards the results of the health status to all impalad process nodes. An Impala cluster only needs one state-stored process node. When a node is unavailable, the process is responsible for passing this information to all impalad process nodes. When there is a new query, the request will not be sent to the unavailable impalad node.

State-stored is also allowed to hang up and will not affect the operation of the cluster because the impalad nodes will also maintain communication. But when the state-stored and a certain part of the impalad are down, there will be a problem. Because without state-stored, the impalad nodes cannot identify whether some impalads are down. Thus, they will still communicate with the impalad that is down.

Catalog

This is the meta-data change synchronization process. Since each impalad can act as a coordinator, when a node receives a data change, such as an alter command, how do other nodes know about it? It can be synchronized through the catalog. The change of each node is notified to the catalog process, and it is synchronized to other nodes. Each node maintains a copy of the latest meta-data information, so that you don’t worry about inconsistent data.
The specific process of the query:

1. Client submits tasks
The client sends a SQL query request to any impalad node, and a query Id is returned for subsequent client operations.

2. Generate query plan (single machine plan, distributed execution plan)
After the SQL is submitted to the impalad node, the analyzer performs SQL lexical analysis, syntax analysis, and semantic analysis in turn. The analyzer obtains meta-data from the MySQL meta-data database, and obtains the data address from the name node of HDFS to obtain all data nodes which store the data related to this query.

-Stand-alone execution plan: According to the analysis of the SQL statement in the previous step, the planner first generates a stand-alone execution plan. The execution plan is a tree composed of plan-nodes. During this process, some SQL will also be executed, such as join order change, predicate push down, etc.

-Distributed parallel physical plan: Convert a single-machine execution plan into a distributed parallel physical execution plan. The physical execution plan is composed of many fragments. There are data dependencies between the fragments. The original execution plan must be adding some exchange node and data stream sink information during processing.

Fragment: a sub task of the distributed execution plan generated by SQL;

Data stream sink: Transmit the current fragment output data to different nodes;

3. Task scheduling and distribution

The coordinator distributes the fragment (sub-task) to different impalad nodes for execution based on the data partition information. When the impalad node receives the fragment execution request, it will be executed by the executor.

4. Data dependency between fragments
The output of each fragment is sent to the next fragment through data stream sink. During the operation of the fragment, it continuously reports the current operation status to the coordinator node.

5. Summary of results
The query SQL usually needs a separate fragment for the summary of the results, which only runs on the coordinator node, summarizes the final execution results of multiple nodes, and converts them into result-set information.

6. Get results
The client calls the interface to obtain the result-set and reads the query result.

Installation and deployment of Impala

1. Installation environment preparation
You need to install Hadoop, hive in advance. Because Impala needs to reference hive’s dependent packages, the Hadoop framework needs to support C program access interface.

2. Download the installation package and dependent packages
Download to all rpm packages and make them as local yum sources for installation.

3. Install Impala
Master node installation

1
2
3
4
5
yum install impala - y
yum install impala - server - y
yum install impala - state - store - y
yum install impala - catalog - y
yum install impala - shell - y

Slave node installation
yum install impala-server -y

4. Modify the configuration of Hadoop and hive

Modify hive configuration
Configure on the master node machine.
vim /export/servers/hive/conf/hive-site.xml
Copy the hive installation package to the other two machines.

1
2
3
4
cd /
  export / servers /
  scp - r hive - 1.1 .0 - cdh5 .14 .0 / node02: $PWD
scp - r hive - 1.1 .0 - cdh5 .14 .0 / node01: $PWD

Modify Hadoop configuration
All nodes create the following folders.
mkdir -p /var/run/hdfs-sockets
Modify the hdfs-site.xml of all nodes, add the following configuration, and restart the HDFS cluster after the modification to take effect.
vim /export/servers/hadoop-2.6.0-cdh5.14.0/etc/hadoop/hdfs-site.xml
Create a connection between Hadoop and hive configuration file
The configuration directory of impala is ‘/etc/impala/con’. Under this path, you need to copy core-site.xml, hdfs-site.xml and hive-site.xml here, and all nodes execute the following command to create a link to the Impala configuration directory.

1
2
3
4
5
6
cp - r /
  export / servers / hadoop - 2.7 .5 / etc / hadoop / core - site.xml / etc / impala / conf / core - site.xml
cp - r /
  export / servers / hadoop - 2.7 .5 / etc / hadoop / hdfs - site.xml / etc / impala / conf / hdfs - site.xml
cp - r /
  export / servers / hive / conf / hive - site.xml / etc / impala / conf / hive - site.xml

5. Modify the configuration file of Impala
All nodes change the Impala default configuration file.

1
2
3
4
vim / etc /
  default / impala
IMPALA_CATALOG_SERVICE_HOST = node03
IMPALA_STATE_STORE_HOST = node03

All nodes create soft connection of MySQL driver package.

1
2
ln - s /
  export / servers / hive - 1.1 .0 - cdh5 .14 .0 / lib / mysql - connector - java - 5.1 .38.jar / usr / share / java / mysql - connector - java.jar

All nodes modify the Java path of bigtop.

1
2
3
vim / etc /
  default / bigtop - utils
export JAVA_HOME = /export/servers / jdk1 .8 .0_141

6. Start the Impala service
The master node starts the following three service processes.

1
2
3
service impala - state - store start
service impala - catalog start
service impala - server start

The slave node starts impala-server.
service impala-server start

Check whether the impala process exists.
ps -ef | grep impala

Enter the shell window, input impala-shell on the master node, and when you see the host name and port number, the installation and deployment is complete!

Impala for interactive SQL queries

The speed of querying data with Impala is very fast (the first time the map reduce calculation is performed, it may be a bit slow because the system needs to load the resources of the table. When the calculation is performed again later, the resources are already in the memory, and the calculation speed is faster.)

Through Hue’s impala query tool, you can quickly query and analyze the data in the data platform, as follows:

Reference:

http://impala.apache.org

Chat on Discord