数据清洗——地域维度

发布时间:2022-07-03 发布网站:脚本宝典
脚本宝典收集整理的这篇文章主要介绍了数据清洗——地域维度脚本宝典觉得挺不错的,现在分享给大家,也给大家做个参考。

1、数据导入

要求将样表文件中的AA_GXJSQYDC2019数据导入HIVe数据仓库中。分别将地域维度表导入数据仓库中。

1)将改名且设置字符集为UTF-8后的文件上传到本地

数据清洗——地域维度

2)在hive中创建表aa_2019 

create table aa_2019(

ID String,

QA04 String,

QA05 String,

QA07 String,

QA15 String,

QA19 String,

Hangye String,

QB03 String,

QB03ONE String,

QB03TWO String,

QB03_1 String,

QB06 String,

QB16 String,

QB16V String,

Gaoxin String,

QB16_1 String,

QB16_1V String,

QC02 String,

QC05_0 String,

QC24 String,

QC40 String,

QD01 String,

QD28 String,

QJ09 String,

QJ20 String,

QJ55 String,

QJ74 String,

DIYu String,

SYEAR String

)ROW format delimITed fields terminated by ',' StoreD AS TEXTFILE;

 

将本地文件导入hive中:

 

load data local inpath '/kkb/install/apache-hive-3.1.2-bin/testdate/aa_2019.csv' into table aa_2019;

数据清洗——地域维度

查看数据正确性:

数据清洗——地域维度

(3)hive中创建表diyu

create table diyu(

dm String,

dmms String

)ROW format delimited fields terminated by ',' STOred AS TEXTFILE;

 

将本地文件导入hive中:

 

load data local inpath '/kkb/install/apache-hive-3.1.2-bin/testdate/diyu.csv' into table diyu;

数据清洗——地域维度

 

 

 

查看数据正确性:

select * From diyu limit 10;

数据清洗——地域维度

 

 

 


 

2、数据清洗

根据标准维度将地域维度字段清洗完成。

(1)删除表的第一行

 

ALTER TABLE diyu set TBLPROPERTIES ('skip.header.line.count'='1');

 

(2)创建表aa_2019存放地域维度清洗完的数据:

 

create table aa_19(

ID String, QA04 String, QA05 String,

 QA07 String, QA15 String, QA19 String,

Hangye String, QB03 String, QB03ONE String,

QB03TWO String, QB03_1 String, QB06 String,

QB16 String, QB16V String, Gaoxin String,

QB16_1 String, QB16_1V String, QC02 String,

QC05_0 String, QC24 String, QC40 String,

 QD01 String, QD28 String, QJ09 String,

QJ20 String, QJ55 String, QJ74 String,

Diyu String, SYEAR String

)ROW format delimited fields terminated by ',' STORED AS TEXTFILE;

数据清洗——地域维度

(3)清洗数据:

insert into table aa_19 select aa_2019.ID as ID , aa_2019.QA04 as QA04, aa_2019.QA05 as QA05, aa_2019.QA07 as QA07, aa_2019.QA15 as QA15, aa_2019.QA19 as QA19, aa_2019.Hangye as Hangye, aa_2019.QB03 as QB03,aa_2019.QB03ONE as QB03ONE, aa_2019.QB03TWO as QB03TWO, aa_2019.QB03_1 as QB03_1, aa_2019.QB06 as QB06, aa_2019.QB16 as QB16, aa_2019.QB16V as QB16V, aa_2019.Gaoxin as Gaoxin, aa_2019.QB16_1 as QB16_1, aa_2019.QB16_1V as QB16_1V, aa_2019.QC02 as QC02, aa_2019.QC05_0 as QC05_0, aa_2019.QC24 as QC24, aa_2019.QC40 as QC40, aa_2019.QD01 as QD01, aa_2019.QD28 as QD28, aa_2019.QJ09 as QJ09, aa_2019.QJ20 as QJ20, aa_2019.QJ55 as QJ55, aa_2019.QJ74 as QJ74, concat(aa_2019.QA19,diyu.dmms) as Diyu, aa_2019.SYEAR as SYEAR from aa_2019 join diyu on (aa_2019.QA19 =diyu.dm)

(4)清洗结果:

select * from table aa_19 limit 10;

数据清洗——地域维度

 


 

3、数据

(1)MySQL中创建表:

create table aa_19(

ID vArchar(255),

QA04 VARchar(255),

QA05 varchar(255),

QA07 varchar(255),

QA15 varchar(255),

QA19 varchar(255),

Hangye varchar(255),

QB03 varchar(255),

QB03ONE varchar(255),

QB03TWO varchar(255),

QB03_1 varchar(255),

QB06 varchar(255),

QB16 varchar(255),

QB16V varchar(255),

Gaoxin varchar(255),

QB16_1 varchar(255),

QB16_1V varchar(255),

QC02 varchar(255),

QC05_0 varchar(255),

QC24 varchar(255),

QC40 varchar(255),

QD01 varchar(255),

QD28 varchar(255),

QJ09 varchar(255),

QJ20 varchar(255),

QJ55 varchar(255),

QJ74 varchar(255),

Diyu varchar(255),

SYEAR varchar(255)

)

 

2)通过sqoop将表导入mySQL

 bin/sqoop export

--connect "jdbc:mysql://node01:3306/hive2?useUnicode=true&characterEncoding=utf-8"

--username root

--password wyhhxx

--table aa_19

--num-mappers 1

--export-dir /user/hive/warehouse/aa_19

--input-fields-terminated-by ","

 

3)导出结果:

数据清洗——地域维度

 

 


4、数据可视化展示

数据清洗——地域维度

 

脚本宝典总结

以上是脚本宝典为你收集整理的数据清洗——地域维度全部内容,希望文章能够帮你解决数据清洗——地域维度所遇到的问题。

如果觉得脚本宝典网站内容还不错,欢迎将脚本宝典推荐好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。
如您有任何意见或建议可联系处理。小编QQ:384754419,请注明来意。