curl 获取json数据导入hive外部双分区表解决科学计数法和特殊字符问题 ——首发

jzy3711

13746人浏览 · 2018-11-19 15:08:50

jzy3711 · 2018-11-19 15:08:50 发布

shell 脚本

hadoop dfs -mkdir /jzytest/hivePublicOpinionList/

#创建hive外部双分区表
hive -e 'create external table if not exists jzyPublicOpinionListjson
(
channel string,
publishTime string,
source string,
title string,
url string
)
partitioned by (topic_id string,dt string)
row format serde "org.apache.hive.hcatalog.data.JsonSerDe"
LOCATION "hdfs:///jzytest/hivePublicOpinionList/";'
channel_array=("21" "104" "303")
dt=$(date "+%Y-%m-%d")
for topic_id in $(cat jzyhive.out)
do
   hadoop dfs -mkdir -p /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}
   for channel in ${channel_array[*]}
   do
   curl -X POST -d '{
       "version":"1.0",
       "caller":"pxxx",
       "componentName":"pxxx",
       "password":"pxxx",
       "callee":"tencent_poa",
       "eventId":293194931,
       "seqId":"1481860426.8432320581878515392",
       "spanId":"pxxx",
       "timestamp":1481860426,
       "interface":{
           "interfaceName":"qcloud.supervision.getPublicOpinionList",
           "para":{
               "topic_id": "'$topic_id'",
               "channel": "'$channel'",
               "page": 1,
               "page_size": 30
               }
           }
       }' http://2xx.1xx.xx2.xx4:9090/interface | awk -v head=":[" -v tail="]}" '{print substr($0, index($0,head)+length(head),index($0,tail)-index($0,head)-length(head))}' > jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/},/}\n/g" jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/《/<</g" jzytopic_id${topic_id}channel${channel}.out
   sed -i "s/》/>>/g" jzytopic_id${topic_id}channel${channel}.out
   hadoop dfs -put jzytopic_id${topic_id}channel${channel}.out /jzytest/hivePublicOpinionList/topic_id=${topic_id}/dt=${dt}/
   rm -rf jzytopic_id${topic_id}channel${channel}.out
   done
   hive -e "alter table jzyPublicOpinionListjson add partition (topic_id='$topic_id',dt='$dt');"
done
第一次标红，是数据格式是{*“data”[{*}]}获取{*} *的内容

第一次sed（未红）是数据的分隔符是“，”号放在hive中会出问题，而hive中默认可以按行读取所以应该换成“\n”，建表事就可以不用考虑分隔符问题

第二次标红，是对书名号进行特殊处理否则会出现以下错误：

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: Start token not found where expected

第三次标红，是防止加载外部分区表变成科学计数法。

总结：

1.获取的json数据要用awk过滤成需要的数据

2.数据中的特殊字符需要出，比如书名号

3.数据中的字段，需要与hive表字段对应，对特殊字段需要处理，如local,如对应不上数据会显示NULL

4.分区字符串如果是纯数字，需要在alter加载分区的时候加“''”单引号，不然会出现科学计数法，例如topic_id=14538744803109468122

最后附上成功后图片，有图有真相

GitCode 开源社区

新一代开源开发者平台 GitCode，通过集成代码托管服务、代码仓库以及可信赖的开源组件库，让开发者可以在云端进行代码托管和开发。旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

GitCode 9 月：小程序新增三大模型专属频道；百大开源项目结果公布；GitCodeAI 社区战略升级全景发布会圆满召开

GitCode 开源社区

混元世界模型 1.1 在 GitCode 开源！秒级生成 3D 世界，开发者速来体验！

GitCode 开源社区

华为盘古 Ultra-MoE-718B-V1.1 正式开放下载！

👉Int8量化版本也同步开源：https://gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1-Int8还记得9月底刷屏的「华为盘古718B」吗？当时凭借的训练哲学，在 SuperCLUE 榜单中一举冲至开源模型第三，成为业界焦点。openPangu-Ultra-MoE-718B-V1.1 是基于昇腾 NPU 训练的大规模混合专家（Mo