January 23, 2021

The Relationship Between Impala and Hive and its Application in Business Intelligence Analysis

Impala and Hive are both data query tools built on Hadoop, each with different focus on adaptability. From the perspective of client use, Impala and Hive have a lot in common, such as data table metadata, ODBC/JDBC driver, SQL syntax, flexible file format, storage resource pool, and more. Hive is suitable for long-term batch query and analysis, and Impala is suitable for real-time interactive SQL query. Impala provides data analysts with big data analysis tools for quick experiments and verification of ideas.

You can use Hive for data conversion first, and then use Impala to perform fast data analysis on the resulting data set processed by Hive.

Impala’s optimization technology compared to Hive’s
image

  1. MapReduce is not used for parallel computing. Although MapReduce is a very good parallel computing framework, it is more batch-oriented rather than an interactive SQL execution. Compared with MapReduce Impala divides the entire query into an execution plan tree instead of a series of MapReduce tasks. After distributing the execution plan, Impala uses the pull method to obtain the results, and the result data is composed according to the execution tree.
    Convergence step by step reduces the overhead of writing intermediate results to disk and then reading data from disk. Impala uses services to avoid the startup overhead that needs to be executed every time a query is executed, that is, compared to Hive, there is no MapReduce startup.

  2. Use LLVM to generate running code, generate specific code for specific queries, and use Inline to reduce the overhead of function calls and speed up execution efficiency.

  3. Make full use of available hardware instructions (SSE4.2).

  4. Better IO scheduling. Impala knows the disk location of the data block to better utilize the advantages of multiple disks. At the same time, Impala supports direct data block reading and local code calculation checksum.

  5. The best performance can be obtained by selecting the appropriate data storage format (Impala supports multiple storage formats).

  6. For the maximum use of memory, the intermediate results are not written to disk, and are transmitted in a stream through the network in time.

Similarities and differences between Impala and Hive

Data storage: The same storage data pool supports storing data in HDFS and HBase.
Metadata: Both use the same metadata.
SQL interpretation processing: It is similar to generating execution plans through lexical analysis.

(1) Implementation plan:

Hive: Relying on the MapReduce execution framework, the execution plan is divided into a model of map->shuffle->reduce->map->shuffle->reduce. If a query will be compiled into multiple rounds of MapReduce, there will be more writing intermediate results

Due to the characteristics of the MapReduce execution framework itself, too many intermediate processes will increase the execution time of the entire query.

Impala: The execution plan is represented as a complete execution plan tree, which can more naturally distribute the execution plan to each Impalad to execute the query, instead of combining it into a pipeline map->reduce mode like Hive.

It proves that Impala has better concurrency and avoids unnecessary intermediate sort and shuffle.

(2) Data flow

Hive: Using the push method, each computing node actively pushes the data to the subsequent nodes after the calculation is completed.

Impala: Using the pull method, subsequent nodes actively ask for data from the previous node through getNext. In this way, the data can be streamed back to the client, and as long as one piece of data is processed, it can be displayed immediately, instead of waiting until all processing is completed, which is more in line with the use of SQL interactive query.

(3) Memory usage

Hive: If all data cannot be stored in the execution process, external storage will be used to ensure that the query can be executed sequentially. After each round of MapReduce is over, the intermediate results will be written to HDFS, also due to MapReduce execution.
Due to the characteristics of the architecture, the shuffle process also involves writing to local disks.

Impala: When the data cannot be stored in the internal storage, the current version 1.0.1 returns an error directly instead of using external storage. Future versions should be improved. This uses Impala to currently process query will be subject to certain restrictions, it is best still used in conjunction with Hive. Impala uses the network to transfer data between multiple stages, and there will be no disk write operations (except for insert) during the execution process.

(4) Scheduling

Hive: Task scheduling depends on Hadoop’s scheduling strategy.

Impala: Scheduling is done by you. At present, there is only one kind of scheduler, simple-schedule, which will try to meet the locality of the data, and the process of scanning data is as close as possible to the physical machine where the data itself is located. The scheduler is still relatively simple, as can be seen in SimpleScheduler::GetBackend, the load, network IO status and other factors are not considered for scheduling. However, Impala already has performance statistics analysis of the execution process, and it should be used in future versions.

(5) Fault tolerance

Hive: Rely on the fault tolerance of Hadoop.

Impala: During the query process, there is no fault-tolerant logic. If a failure occurs during execution, an error will be returned directly. This is related to the design of Impala, because Impala is positioned for real-time query, and if a query fails, it will be checked again.

The cost of checking again is very low. But on the whole, Impala is very fault-tolerant. All Impalads have a peer-to-peer structure. Users can submit queries to any Impalad. If an Impalad fails, all queries running on the above will fail, but the user can resubmit the query to be executed by other Impalads, without affecting the service. There is currently only one State Store, but when the State Store fails, it will not affect the service.

Each Impalad caches the State Store information, but the cluster state can no longer be updated, and execution tasks may be assigned to the failed Impalad for execution, causing this query to fail.

(6) Applicable surface

Hive: Complex batch query tasks and data conversion tasks.

Impala: Real-time data analysis. Because UDF is not supported, the problem domain that can be handled is limited. It can be used in conjunction with Hive to perform real-time analysis of the Hive result data set.

Application in Business Intelligence (BI)
Interactions with Impala and Hive mainly include Impala/Hive shell and Hue web UI. Access Hive and Impala through Hue, each of which has editors.

image

Both Hive and Impala are data query tools built on Hadoop, so how do they load and store data in actual applications? Take Impala as an example to talk about how to process data in BI. The whole process is as follows.

image

Data processing:

Extract data
First, extract the data from the data source to Hadoop through the Sqoop script. At the same time, a corresponding table is built in the Hadoop DB layer to store the extracted data. Refresh the Impala table.

Code:

1
2
3
4
5
6
7
8
#!/bin/sh

source / etc / profile
source~/.bash_profile
sqoop
import --connect jdbc: oracle: thin: @10 .8 .17 .23: 1521 / FINDM--username test--password = test123--target - dir / FINBI / DB / EBS_CUX_PA_CONTRACT_RPT_D_ALL--query "select SUBCON_TRACK_ID,ORG_ID,SEGMENT1,CONTRACT_NUMBER,PERIOD_NAME,ACCOUNT_CODE,ACCOUNT_NAME,ACCOUNT_AMOUNT,LAST_UPDATE_DATE,LAST_UPDATED_BY,CREATION_DATE,CREATED_BY,LAST_UPDATE_LOGIN,REQUEST_ID
from CUX_PA_CONTRACT_RPT_D_ALL @FINDM_EBS where\ $CONDITIONS "    --delete-target-dir  --fields-terminated-by '\001' --null-string '\\N' --null-non-string '\\N' --hive-drop-import-delims --lines-terminated-by '\n'  -m 1 [[$ ? -eq 0]] || exit
impala - shell - q 'refresh DLA_DB.EBS_CUX_PA_CONTRACT_RPT_D_ALL;'

The result is:
image

Create partition
Data partitioning will greatly improve the efficiency of a data query. The ODS layer in Hadoop uses ‘PARTITIONED BY’ to create partitioned tables.
Code:

1
2
INSERT OVERWRITE TABLE DLA_ODS.ODS_EBS_PA_CONTRACT_RPT_D_ALL PARTITION(DT) SELECT * , NOW(), NOW(), '2020-07-31'
DT FROM DLA_DB.EBS_CUX_PA_CONTRACT_RPT_D_ALL

The result is:
image

Logical calculation
Some aggregation and logical calculations are usually performed at the DWD layer. The aggregated data will be stored in the newly created table.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
CREATE TABLE FINBI_DWD.DWD_TMP_CUX_PA_CONTRACT_ALL(
  SUBCON_TRACK_ID DOUBLE,
  PERIOD_NAME STRING,
  ORG_ID DOUBLE,
  ORG_NAME STRING,
  CONTRACT_HEADER_ID,
  CONTRACT_NUMBER STRING,
  CUSTOMER_ID DOUBLE,
  CUSTOMER_NUMBER STRING,
  CUSTOMER_NAME STRING,
  COMMENTS STRING,
  CONT_AMT DOUBLE,
  UNTAX_CONT_AMT DOUBLE,
  CON_GROSS_MARGIN DOUBLE,
  LAST_UPDATE_DATE TIMESTAMP,
  LAST_UPDATED_BY DOUBLE,
  LAST_UPDATE_LOGIN DOUBLE,
  CREATION_DATE TIMESTAMP,
  CREATED_BY DOUBLE,
  ORIG_CONTRACT_NUMBER STRING,
  CONTRACT_END_DATE TIMESTAMP,
  SOURCE_NAME STRING,
  CONT_AMOUNT DOUBLE W_INSERT_DATE TIMESTAMP,
  W_UPDATE_DATE TIMESTAMP,
)

PARTITIONED BY(dt STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\n'
WITH SERDEPROPERTIES('field.delim' = '\u0001', 'line.delim' = '\n', 'serialization.format' = '\u0001') STORED AS PARQUET LOCATION 'hdfs://nameservice1/FINBI/DWD/DWD_TMP_CUX_PA_CONTRACT_ALL';

The result is:
image

The DM layer in Hadoop generally stores the latest partitioned data. The way to build a table is similar to DWD.

image

Transfer data to Oracle or other BI system database
Transfer the processed data from the DM layer of Hadoop to Oracle or other BI system database through the Sqoop script. The BI system used in this article is Oracle BIEE.

Code:

1
2
3
4
5
6
7
#!/bin/bash

source / etc / profile
source~/.bash_profile
java - cp / opt / obiee / warehouseOracle - 0.0 .1 - SNAPSHOT - jar - with - dependencies.jar com.TruncateTable 10.8 .17 .23 1521 FINDM test test123 DM_F_COST_SYNERGISM1[[$ ? -eq 0]] || exit
sqoop
export --connect jdbc: oracle: thin: @ //10.8.17.23:1521/FINDM --username test  --password test123  --table DM_F_COST_SYNERGISM1 --columns "MONTH_WID,SCENES_DATA,PERIOD_DATA,ASSESS1_CENTER_WID,ASSESS2_CENTER_WID,CURRENCY_TYPE_WID,DATA_SIZE_WID,MARKET_CLASSIFY,INDUSTRY,DISTRIBUTION_TYPE_WID,FOCUS_FLAG,EXPENSE_ACCOUNT_WID,INCOMES_BUSINESS,GROSS_AMOUNT,EMPLOYEE_AMOUNT,OVERSEAS_COST,CHINA_COST,GROUP_COST,INSERT_DATE,UPDATE_DATE " --fields-terminated-by '\001'  --input-null-string '\\N' --input-null-non-string '\\N'   --hcatalog-database FINBI_DM --hcatalog-table DM_F_COST_SYNERGISM    --num-mappers 20

Show on the dashboard
Finally, display the data on the BI system interface.
image

Reference:
http://impala.apache.org/

Group 9
Group 9