hive之Json解析(普通Json和Json数组)

json

适用于现代 C++ 的 JSON。

项目地址：https://gitcode.com/gh_mirrors/js/json

免费下载资源

从零开始的数据猿

5518人浏览 · 2022-05-04 08:24:41

从零开始的数据猿 · 2022-05-04 08:24:41 发布

一、数据准备

现准备原始json数据(test.json)如下：

{“movie”:“1193”,“rate”:“5”,“timeStamp”:“978300760”,“uid”:“1”}
{“movie”:“661”,“rate”:“3”,“timeStamp”:“978302109”,“uid”:“1”}
{“movie”:“914”,“rate”:“3”,“timeStamp”:“978301968”,“uid”:“1”}
{“movie”:“3408”,“rate”:“4”,“timeStamp”:“978300275”,“uid”:“1”}
{“movie”:“2355”,“rate”:“5”,“timeStamp”:“978824291”,“uid”:“1”}
{“movie”:“1197”,“rate”:“3”,“timeStamp”:“978302268”,“uid”:“1”}
{“movie”:“1287”,“rate”:“5”,“timeStamp”:“978302039”,“uid”:“1”}
{“movie”:“2804”,“rate”:“5”,“timeStamp”:“978300719”,“uid”:“1”}
{“movie”:“594”,“rate”:“4”,“timeStamp”:“978302268”,“uid”:“1”}

现在将数据导入到hive中，并且最终想要得到这么一个结果：

可以使用：内置函数（get_json_object）或者自定义函数完成

二、get_json_object(string json_string, string path)

返回值：String

说明：解析json的字符串json_string，返回path指定的内容。如果输入的json字符串无效，那么返回NUll，这个函数每次只能返回一个数据项。

 0: jdbc:hive2://hadoop3:10000> select get_json_object('{"movie":"594","rate":"4","timeStamp":"978302268","uid":"1"}','$.movie');

1、创建json表并将数据导入

 0: jdbc:hive2://master:10000> create table json(data string);
No rows affected (0.572 seconds)
0: jdbc:hive2://master:10000> **load data local inpath '/home/hadoop/json.txt' into table json;** 
No rows affected (1.046 seconds)

 0: jdbc:hive2://master:10000> select get_json_object(data,'$.movie') as movie from json;

三、json_tuple(jsonStr, k1, k2, …)

参数为一组键k1，k2，。。。。。和json字符串，返回值的元组。该方法比get_json_object高效，因此可以在一次调用中输入多次键

 0: jdbc:hive2://master:10000> select b.b_movie,b.b_rate,b.b_timeStamp,b.b_uid from json a lateral view 
json_tuple(a.data,'movie','rate','timeStamp','uid') b as b_movie,b_rate,b_timeStamp,b_uid;

注意点：

json_tuple相当于get_json_object的优势就是一次可以解析多个Json字段。但是如果我们有个Json数组，这两个函数都无法处理

四、Json数组解析

1、使用Hive自带的函数解析Json数组

Hive的内置的explode函数，explode()函数接收一个 array或者map 类型的数据作为输入，然后将 array 或 map 里面的元素按照每行的形式输出。其可以配合 LATERAL VIEW 一起使用。


hive> select explode(array('A','B','C'));
OK
A
B
C
Time taken: 4.879 seconds, Fetched: 3 row(s)
hive> select explode(map('A',10,'B',20,'C',30));
OK
A       10
B       20
C       30
Time taken: 0.261 seconds, Fetched: 3 row(s)

这个explode函数和我们解析json数据是有关系的，我们可以使用explode函数将json数组里面的元素按照一行一行的形式输出：

hive> SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]', '\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;'));
OK
{"website":"www.baidu.com","name":"百度"}
{"website":"google.com","name":"谷歌"}
Time taken: 0.14 seconds, Fetched: 2 row(s)

说明:


SELECT explode(split(
   regexp_replace(
       regexp_replace(
           '[
               {"website":"www.baidu.com","name":"百度"},
               {"website":"google.com","name":"谷歌"}
           ]', 
           '\\[|\\]',''),  --将 Json 数组两边的中括号去掉

                '\\}\\,\\{'    --将 Json 数组元素之间的逗号换成分号
               ,'\\}\\;\\{'),

                '\\;'));    --以分号作为分隔符

结合 get_json_object 或 json_tuple 来解析里面的字段：

hive> select json_tuple(json, 'website', 'name') from (SELECT explode(split(regexp_replace(regexp_replace('[{"website":"www.baidu.com","name":"百},{"website":"google.com","name":"谷歌"}]', '\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as json) test;
OK
www.baidu.com   百度
google.com      谷歌
Time taken: 0.283 seconds, Fetched: 2 row(s)

2、自定义函数解析JSON数组

虽然可以使用Hive自带的函数类解析Json数组，但是使用起来有些麻烦。Hive提供了强大的自定义函数（UDF）的接口，我们可以使用这个功能来编写解析JSON数组的UDF。具体测试过程如下：

 <dependencies>
        <dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>2.1.1</version>
        </dependency>
    </dependencies>

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONArray;
import org.json.JSONException;
import java.util.ArrayList;

@Description(name = "json_array",
        value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.")
public class JsonArray extends UDF{
        public ArrayList<String> evaluate(String jsonString) {
            if (jsonString == null) {
                return null;
            }
            try {
                JSONArray extractObject = new JSONArray(jsonString);
                ArrayList<String> result = new ArrayList<String>();
                for (int ii = 0; ii < extractObject.length(); ++ii) {
                    result.add(extractObject.get(ii).toString());
                }
                return result;
                } catch (JSONException e) {
                return null;
                } catch (NumberFormatException e) {
                return null;
            }
        }

}

将上面的代码进行编译打包,jar包名为：HiveJsonTest-1.0-SNAPSHOT.jar

 hive> add jar /mnt/HiveJsonTest-1.0-SNAPSHOT.jar;
Added [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar] to class path
Added resources: [/mnt/HiveJsonTest-1.0-SNAPSHOT.jar]

 hive> create temporary function json_array as 'JsonArray';
OK
Time taken: 0.111 seconds

 

hive> select explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com"name":"谷歌"}]'));
OK
{"website":"www.baidu.com","name":"百度"}
{"website":"google.com","name":"谷歌"}
Time taken: 10.427 seconds, Fetched: 2 row(s)
 

 

hive> select json_tuple(json, 'website', 'name') from (SELECT explode(json_array('[{"website":"www.baidu.com","name":"百度"},{"website":"google.com","name":"谷歌"}]')) as json) test;
OK
www.baidu.com   百度
google.com      谷歌
Time taken: 0.265 seconds, Fetched: 2 row(s)

3、自定义函数解析json对象

package com.laotou;

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;
import org.json.JSONTokener;

/**
 *
 *  add jar jar/bdp_udf_demo-1.0.0.jar;
 * create temporary function getJsonObject as 'com.laotou.JsonObjectParsing';
 * Json对象解析UDF
 * @Author: 
 * @Date: 2019/8/9
 */
public class JsonObjectParsing extends UDF {
    public static String evaluate(String jsonStr, String keyName) throws JSONException {
        if(StringUtils.isBlank(jsonStr) || StringUtils.isBlank(keyName)){
            return null;
        }
        JSONObject jsonObject = new JSONObject(new JSONTokener(jsonStr));
        Object objValue = jsonObject.get(keyName);
        if(objValue==null){
            return null;
        }
        return objValue.toString();
    }
}

3、1准备数据

3、2测试

GitHub 加速计划 / js / json

下载

适用于现代 C++ 的 JSON。

最近提交(Master分支：2 个月前 )

960b763e 5 个月前

8c391e04 8 个月前

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

[转载]在Windows环境下安装GNU Radio

转自：在Windows环境下安装GNURadio_恐弱智_新浪博客GNU Radio是用Python开发的，大部分开源的工程能够在Linux环境下运行良好，而Windows下却运行的很勉强，而且安装配置都很复杂。GNU Radio算是个例外了，不光提供了Windows的二进制安装，还有比较详细的说明。我是Python小白，所以折腾了好久才弄好，特意记录下来，免得以后再装还折腾。GNU Radio的

GitCode 开源社区

centOS 8 使用dnf安装Docker

DNF是什么？CentOS 8使用YUM软件包管理器版本v4.0.4。现在，该版本使用DNF(已删除YUM)。DNF是软件包管理器。它会在Linux发行版上安装，执行更新并删除软件包。使用DNF安装Docker跳过具有损坏依赖性的程序包一个有效的解决方案是使您的CentOS 8系统使用以下--nobest命令安装最符合条件的版本：sudo dnf install docker...

GitCode 开源社区

定时同步数据库表(mysql+linux+crontab)

sync.sh里面的参数需要改变，ip/username/password/database/tablesync.sh#!/bin/sh# Please change the IP and password of the data source db.# Then change the table name.filename=/home/nington/db/$(date +%Y-%m