First Head

Second Head

Main Body
WTF

快速部署

数据库操作简述

DDL Operations

The Hive DDL operations are documented in Hive Data Definition Language

创建数据库例子:

1
hive> create database if not exists user_db;

查看数据库定义:
Describe 命令来查看数据库定义,包括:数据库名称、数据库在HDFS目录、HDFS用户名称。

1
2
3
hive> describe database user_db;
OK
user_db hdfs://bigdata-51cdh.chybinmy.com:8020/user/hive/wa

查看数据库列表

1
2
3
4
hive> show databases;   
OK
user_db
default

创建普通表

1
2
3
4
5
6
7
8
9
10
11
hive> create table if not exists userinfo  
> (
> userid int,
> username string,
> cityid int,
> createtime date
> )
> row format delimited fields terminated by '\t'
> stored as textfile;
OK
Time taken: 2.133 seconds

以上例子是创建表的一种方式,如果表不存在,就创建表userinfo。row format delimited fields terminated by ‘\t’ 是指定列之间的分隔符;stored as textfile是指定文件存储格式为textfile。
创建表一般有几种方式:

  • create table 方式:以上例子中的方式。

  • create table as select 方式:根据查询的结果自动创建表,并将查询结果数据插入新建的表中。

  • create table like tablename1 方式:是克隆表,只复制tablename1表的结构。复制表和克隆表会在下面的Hive数据管理部分详细讲解。

DML Operations

The Hive DML operations are documented in Hive Data Manipulation Language.
向Hive中加载数据
加载到普通表

可以将本地文本文件内容批量加载到Hive表中,要求文本文件中的格式和Hive表的定义一致,包括:字段个数、字段顺序、列分隔符都要一致。

这里的user_info表的表定义是以\t作为列分隔符,所以准备好数据后,将文本文件拷贝到hive客户端机器上后,执行加载命令。

1
load data local inpath '/home/hadoop/userinfodata.txt' overwrite into table user_info;

local关键字表示源数据文件在本地,源文件可以在HDFS上,如果在HDFS上,则去掉local,inpath后面的路径是类似”hdfs://namenode:9000/user/datapath”这样的HDFS上文件的路径。

overwrite关键字表示如果hive表中存在数据,就会覆盖掉原有的数据。如果省略overwrite,则默认是追加数据。

加载完成数据后,在HDFS上就会看到加载的数据文件

导出数据
–导出数据,是将hive表中的数据导出到本地文件中。

1
2
insert overwrite local directory '/home/hadoop/user_info.bak2016-08-22 '
select * from user_info;

去掉local关键字,也可以导出到HDFS上。

插入数据
insert select 语句

上一节分桶表数据导入,用到从userleadstmp表向user_leads表中导入数据,用到了insert数据。

1
insert overwrite table user_leads select * from  user_leads_tmp;

这里是将查询结果导入到表中,overwrite关键字是覆盖目标表中的原来数据。如果缺省,就是追加数据。

如果是插入数据的表是分区表,那么就如下所示:

1
2
insert overwrite table user_leads PARTITION (dt='2017-05-26') 
select * from user_leads_tmp;

HQL语法

SQL Operations
The Hive query operations are documented in Select.

Example Queries
Some example queries are shown below. They are available in build/dist/examples/queries.
More are available in the Hive sources at ql/src/test/queries/positive.

SELECTS and FILTERS

1
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';

selects column ‘foo’ from all rows of partition ds=2008-08-15 of the invites table. The results are not stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a Hive table, local directory or HDFS directory) is optional.

1
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';

selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory.
NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

1
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

selects all rows from pokes table into a local directory.

1
2
3
4
5
6
7
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

selects the sum of a column. The avg, min, or max can also be used. Note that for versions of Hive which don’t include HIVE-287, you’ll need to use COUNT(1) in place of COUNT(*).

GROUP BY

1
2
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Note that for versions of Hive which don’t include HIVE-287, you’ll need to use COUNT(1) in place of COUNT(*).

JOIN

1
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT

1
2
3
4
5
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

官方的简单完整的例子

Simple Example Use Cases

MovieLens User Ratings

First, create a table with tab-delimited text file format:

1
2
3
4
5
6
7
8
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

1
2
3
4
5
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

or:

curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note: If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the user@hive.apache.org mailing list.

Unzip the data files:

1
unzip ml-100k.zip

And load u.data into the table that was just created:

1
2
LOAD DATA LOCAL INPATH '<path>/u.data'
OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

1
SELECT COUNT(*) FROM u_data;

Note that for older versions of Hive which don’t include HIVE-287, you’ll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

1
2
3
4
5
6
7
8
9
10
Create weekday_mapper.py:

import sys
import datetime

for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

add FILE weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Hive快速入门

Docker-Hive 项目

Hive Doc

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1
$ hexo new "My New Post"

More info: Writing

Run server

1
$ hexo server

More info: Server

Generate static files

1
$ hexo generate

More info: Generating

Deploy to remote sites

1
$ hexo deploy

More info: Deployment

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×