Presto: 是谁偷走了我的一天

举报
孙中明 发表于 2022/08/08 23:32:33 2022/08/08
【摘要】 Presto (including PrestoDB and PrestoSQL later re-branded to Trino) is a distributed query engine...

Presto (including PrestoDB and PrestoSQL later re-branded to Trino) is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata. One can even query data from multiple data sources within a single query. Presto is community driven open-source software released under the Apache License.

起因

最近在数据迁移的过程中遇到一个好玩的bug, 当我们要计算两个日期的间隔的是时候需要调用一些数据分析组件内置的函数, 如下所示: 代码表示的含义很简单 就是计算8月9日和8月8日之间相隔天数, 但是这样的计算的结果竟然不相同: 竟然出现同样的时间段里hive presto多一天这种情况. 即在hive分析中得到是相隔一天, 但是在presto分析中得到数据只有0天 .

(组件版本号: Hive:2.3.7 Presto:0.240.1)

---hive: 1
select datediff(
    '2022-08-09',
    '2022-08-08 08:00:01.0' 
   
  ) as b
---presto: 0
select
  date_diff(
    'day',
   cast('2022-08-08 08:00:01.0' as timestamp),
   cast('2022-08-09'  as timestamp)
  ) as b


  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

探究

遇事不決 量子力学; 问题是小问题, 但是假如运用到生活的例子是你发工资的时候计算的时长, 这不就少计算一天. 为了打工人, 一定要找到是谁偷走了presto的一天.
首先, 看了找了一下https://github.com/apache/hivedatediff()

/**
 * UDFDateDiff.
 *
 * Calculate the difference in the number of days. The time part of the string
 * will be ignored. If dateString1 is earlier than dateString2, then the
 * result can be negative.
 *
 */
@Description(name = "datediff",
    value = "_FUNC_(date1, date2) - Returns the number of days between date1 and date2",
    extended = "date1 and date2 are strings in the format "
        + "'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'. The time parts are ignored."
        + "If date1 is earlier than date2, the result is negative.\n"
        + "Example:\n "
        + "  > SELECT _FUNC_('2009-07-30', '2009-07-31') FROM src LIMIT 1;\n"
        + "  1")


  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17

官方文档说的很清楚, yyyy-MM-dd HH:mm:ss or yyyy-MM-dd. The time parts are ignored. 在计算的过程中, 会自动截取时间, 只用日期进行计算. 那么以此类推, 是不是 presto 没有进行截取呢? 可能吗

select cast('2022-08-09'  as timestamp) as b

-- 2022-08-09 00:00:00.000
-- 

  
 
  • 1
  • 2
  • 3
  • 4

果然, 2022-08-09 转成timestamp日期格式就是 2022-08-09 00:00:00.000 时间直接用午夜零点补齐了. 如果不转timestamp 直接用日期可以吗?



select
  date_diff(
    'day',
   cast('2022-08-08 08:00:01.0' as timestamp),
   '2022-08-09'
  ) as b


  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

得报错了,貌似是格式不对,看来又得去官方源码找了, 从报错信息来看两者的日期类型需要对齐保持一样.

Unexpected parameters (varchar(3), timestamp, varchar(10)) for function date_diff. Expected:
date_diff(varchar(x), timestamp, timestamp) ,
date_diff(varchar(x), date, date) ,
date_diff(varchar(x), time, time) ,
date_diff(varchar(x), time with time zone, time with time zone) ,
date_diff(varchar(x), timestamp with time zone, timestamp with time zone)

https://github.com/prestodb/presto


   ......

    @Description("difference of the given times in the given unit")
    @ScalarFunction("date_diff")
    @LiteralParameters("x")
    @SqlType(StandardTypes.BIGINT)
    public static long diffTimestamp(
            SqlFunctionProperties properties,
            @SqlType("varchar(x)") Slice unit,
            @SqlType(StandardTypes.TIMESTAMP) long timestamp1,
            @SqlType(StandardTypes.TIMESTAMP) long timestamp2)
    {
        if (properties.isLegacyTimestamp()) {
            return getTimestampField(getChronology(properties.getTimeZoneKey()), unit).getDifferenceAsLong(timestamp2, timestamp1);
        }

        return getTimestampField(UTC_CHRONOLOGY, unit).getDifferenceAsLong(timestamp2, timestamp1);
    }
.....


  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21

从中定位到了date_diff 函数, 这个其中当是timestamp 的时候, 会调用这个 getDifferenceAsLong



    /**
     * Computes the difference between two instants, as measured in the units
     * of this field. Any fractional units are dropped from the result. Calling
     * getDifference reverses the effect of calling add. In the following code:
     *
     * <pre>
     * long instant = ...
     * long v = ...
     * long age = getDifferenceAsLong(add(instant, v), instant);
     * </pre>
     *
     * The value 'age' is the same as the value 'v'.
     *
     * @param minuendInstant the milliseconds from 1970-01-01T00:00:00Z to
     * subtract from
     * @param subtrahendInstant the milliseconds from 1970-01-01T00:00:00Z to
     * subtract off the minuend
     * @return the difference in the units of this field
     */
    public abstract long getDifferenceAsLong(long minuendInstant, long subtrahendInstant);


  
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

以上, 可以得出, 当不同date 类型 转成 timestamp的时候 时间部分直接用 00:00:00.000 替代, 然后调用的getDifferenceAsLong 是通过和 1970-01-01T00:00:00Z 进行比较 然后返回数值, 当其数值不大于24小时的部门, 自然不会计算当做一天. 所以当我们在计算时间间隔的时候 , 有时候精确度不需要那么苛刻的时候 , 建议 date_diff(varchar(x), date, date) 参数进行计算.

由于时间仓促、水平有限,其中有许多不足之处在所难免,敬请各位大佬批评指正,给予多多留言指导。

[ 1 ] https://github.com/prestodb/presto
[ 2 ] https://github.com/apache/hive
[ 3 ] https://prestodb.io/

文章来源: hiszm.blog.csdn.net,作者:孙中明,版权归原作者所有,如需转载,请联系作者。

原文链接:hiszm.blog.csdn.net/article/details/126193242

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。