详解GaussDB for DWS数据库查询关联的广播和重分布

举报
carl@1234 发表于 2020/06/11 16:36:42 2020/06/11
【摘要】 当两张表关联的时候,如果一张表关联的不是分布列,那么就会发生表的广播或者重分布,将数据移到一个节点上进行关联,从而获得数据。这里详细的介绍什么时候广播,什么时候重分布。 分布式的关联有两种:(1) 单DN关联。关联键与分布列一致,只要在单个库关联后得到结果即可。(2)跨DN关联。关联键与分布列不一致,需要数据重分布,转换成单库关联查询,从而实现表的关联。 不管是采用广播还是重分布首先必须保证...

当两张表关联的时候,如果一张表关联的不是分布列,那么就会发生表的广播或者重分布,将数据移到一个节点上进行关联,从而获得数据。这里详细的介绍什么时候广播,什么时候重分布。

 分布式的关联有两种:

(1) DN关联。关联键与分布列一致,只要在单个库关联后得到结果即可。

2)跨DN关联。关联键与分布列不一致,需要数据重分布,转换成单库关联查询,从而实现表的关联。

 不管是采用广播还是重分布首先必须保证结果的正确性,其次才考虑性能。下面将从内连接、左连接、全连接三中连接情况进行分析:

                测试表ab的定义

表名                字段              分布列       记录数

 a       seqnum bigint ,rd1 int       seqnum        M

 b       seqnum bigint, rd1 int       seqnum        N

 create table a(seqnum int, rd1 int);

create table b(seqnum int, rd1 int); 

内连接

1.1 情况1关联列为分布列

    select * from  a,b where a.seqnum=b.seqnum;

    carl=# explain select * from  a,b where a.seqnum=b.seqnum;

                                QUERY PLAN                                

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=117.74..737.59 rows=10000 width=24)

   Node/s: All datanodes

   ->  Hash Join  (cost=113.74..399.18 rows=10000 width=24)

         Hash Cond: (b.seqnum = a.seqnum)

         ->  Seq Scan on b  (cost=0.00..236.17 rows=131000 width=12)

         ->  Hash  (cost=103.33..103.33 rows=10000 width=12)

              ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(7 rows)

1.2 情况2:表a的关联键是分布列,表b的关联键不是分布列

   select * from  a,b where a.seqnum=b.rd1;

通过两种方式进行关联:

(1)b按照rd1字段将数据重分布到各个节点上,然后再与表a进行关联。重分布的数据量是N

carl=# explain select * from  a,b where a.seqnum=b.rd1;

                                      QUERY PLAN                                      

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=117.74..1551.39 rows=10007 width=24)

   Node/s: All datanodes

   ->  Hash Join  (cost=113.74..1212.58 rows=10007 width=24)

         Hash Cond: (b.rd1 = a.seqnum)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..1053.00 rows=120000 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on b  (cost=0.00..228.00 rows=120000 width=12)

         ->  Hash  (cost=103.33..103.33 rows=10000 width=12)

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(9 rows)

 

 

(2)将表a广播,每一个节点一份全量数据,然后再与表b进行关联。广播的数据量是M*DN个数

carl=# explain select * from  a,b where a.rd1=b.rd1;

                                    QUERY PLAN                                    

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=376.63..1635.37 rows=10007 width=24)

   Node/s: All datanodes

   ->  Hash Join  (cost=372.63..1296.56 rows=10007 width=24)

         Hash Cond: (a.rd1 = b.rd1)

         ->  Streaming(type: BROADCAST)  (cost=0.00..878.09 rows=120000 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

         ->  Hash  (cost=236.17..236.17 rows=131000 width=12)

               ->  Seq Scan on b  (cost=0.00..236.17 rows=131000 width=12)

(9 rows)

 a表的数据量(M)大于b表的数据量(N)时,则永远是使用表b重分布方式。

 a表的数据量(M)小于b表的数据量(N)时,当(N-M> M*DN个数时,将表a广播到每一个节点上,再与表b进行关联;当(N-M<M*DN个数时,则将表b重分布到各个节点上,再与表a进行关联。

1.3  情况3:表a的关联键不是分布列,表b的关联键不是分布列

  MAXM,N> MINM,N*DN个数时,将表aMINM,N)行数少的表)广播到每一个节点上,再与表b(MAX(M,N)行数较多的表)进行关联,广播的的代价是MINM,N)*DN个数

 carl=# explain select * from  a,b where a.rd1=b.rd1;

                                    QUERY PLAN                                    

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=374.74..1633.48 rows=10011 width=24)

   Node/s: All datanodes

   ->  Hash Join  (cost=370.74..1294.67 rows=10011 width=24)

         Hash Cond: (a.rd1 = b.rd1)

         ->  Streaming(type: BROADCAST)  (cost=0.00..878.09 rows=120000 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

         ->  Hash  (cost=235.33..235.33 rows=130000 width=12)

               ->  Seq Scan on b  (cost=0.00..235.33 rows=130000 width=12)

(9 rows)

  MAXM,N< MINM,N*DN个数时,则将表a,b分别重分布到各个节点上,重分布的代价是M+N

carl=# explain select * from  a,b where a.rd1=b.rd1;

                                        QUERY PLAN                                        

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=186.46..1543.71 rows=10026 width=24)

   Node/s: All datanodes

   ->  Hash Join  (cost=182.46..1204.08 rows=10026 width=24)

         Hash Cond: (b.rd1 = a.rd1)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..978.74 rows=110500 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on b  (cost=0.00..219.08 rows=110500 width=12)

         ->  Hash  (cost=172.05..172.05 rows=10000 width=12)

               ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..172.05 rows=10000 width=12)

                     Spawn on: All datanodes

                     ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(11 rows)

左连接

2.1 情况1关联列为分布列

select * from  a left join b  on a.seqnum=b.seqnum;

carl=# explain select * from  a left join b  on a.seqnum=b.seqnum;

                                QUERY PLAN                                

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=117.74..880.05 rows=14439 width=24)

   Node/s: All datanodes

   ->  Hash Right Join  (cost=113.74..391.33 rows=14439 width=24)

         Hash Cond: (b.seqnum = a.seqnum)

         ->  Seq Scan on b  (cost=0.00..228.04 rows=120050 width=12)

         ->  Hash  (cost=103.33..103.33 rows=10000 width=12)

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(7 rows)

2.2   情况2:表a的关联键是分布列,表b的关联键不是分布列

  不管表b有多大,表b的数据总是被重分布

 

carl=# explain select * from  a left join b  on a.seqnum=b.rd1;

                                      QUERY PLAN                                      

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=130.87..1411.37 rows=1001 width=24)

   Node/s: All datanodes

   ->  Hash Right Join  (cost=126.87..1377.65 rows=1001 width=24)

         Hash Cond: (b.rd1 = a.seqnum)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..1206.20 rows=140000 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on b  (cost=0.00..243.67 rows=140000 width=12)

         ->  Hash  (cost=125.83..125.83 rows=1000 width=12)

               ->  Seq Scan on a  (cost=0.00..125.83 rows=1000 width=12)

 

 

carl=# explain select * from  a left join b  on a.seqnum=b.rd1;

                                       QUERY PLAN                                        

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=172.72..4468.55 rows=120000 width=24)

   Node/s: All datanodes

   ->  Hash Left Join  (cost=168.72..406.05 rows=120000 width=24)

         Hash Cond: (a.seqnum = b.rd1)

         ->  Seq Scan on a  (cost=0.00..199.00 rows=120000 width=12)

         ->  Hash  (cost=167.68..167.68 rows=1000 width=12)

               ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..167.68 rows=1000 width=12)

                     Spawn on: All datanodes

                     ->  Seq Scan on b  (cost=0.00..160.83 rows=1000 width=12)

(9 rows)

2.3 情况3:表a的关联键不是分布列,表b的关联键不是分布列

 a和表b则会分别按字段rd1重分布,再关联

carl=# explain select * from  a left join b  on a.rd1=b.rd1;

                                        QUERY PLAN                                        

----------------------------------------------------------------- Streaming (type: GATHER)  (cost=186.46..1703.34 rows=10009 width=24)

   Node/s: All datanodes

   ->  Hash Right Join  (cost=182.46..1364.52 rows=10009 width=24)

         Hash Cond: (b.rd1 = a.rd1)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..1132.94 rows=130500 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on b  (cost=0.00..235.75 rows=130500 width=12)

         ->  Hash  (cost=172.05..172.05 rows=10000 width=12)

               ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..172.05 rows=10000 width=12)

                     Spawn on: All datanodes

                     ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(11 rows)

全连接

3.1 情况1:关联列为分布列

   

carl=# explain select * from  a full  join b  on a.seqnum=b.seqnum;

                                QUERY PLAN                                

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=117.74..4830.17 rows=130500 width=24)

   Node/s: All datanodes

   ->  Hash Full Join  (cost=113.74..412.20 rows=130500 width=24)

         Hash Cond: (b.seqnum = a.seqnum)

         ->  Seq Scan on b  (cost=0.00..235.75 rows=130500 width=12)

         ->  Hash  (cost=103.33..103.33 rows=10000 width=12)

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(7 rows)

 

3.2 情况2:表a关联键是分布列,表b关联键不是分布列

       b数据被重分布到各个节点上,再与表a关联

carl=# explain select * from  a full  join b  on a.seqnum=b.rd1;

                                      QUERY PLAN                                      

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=117.74..5727.36 rows=130500 width=24)

   Node/s: All datanodes

   ->  Hash Full Join  (cost=113.74..1309.39 rows=130500 width=24)

         Hash Cond: (b.rd1 = a.seqnum)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..1132.94 rows=130500 width=12)

               Spawn on: All datanodes

               ->  Seq Scan on b  (cost=0.00..235.75 rows=130500 width=12)

         ->  Hash  (cost=103.33..103.33 rows=10000 width=12)

               ->  Seq Scan on a  (cost=0.00..103.33 rows=10000 width=12)

(9 rows)

3.3 情况3 a的关联键不是分布列,表b的关联键不是分布列

a和表b则会分别按字段rd1重分布,再关联

carl=# explain select * from  a full  join b  on a.rd1=b.rd1;

                                        QUERY PLAN                                        

-----------------------------------------------------------------

 Streaming (type: GATHER)  (cost=213.47..6179.47 rows=140000 width=24)

   Node/s: All datanodes

   ->  Hash Full Join  (cost=209.47..1439.75 rows=140000 width=24)

         Hash Cond: (a.rd1 = b.rd1)

         ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..1178.20 rows=140000 width=12)

    Spawn on: All datanodes

               ->  Seq Scan on a  (cost=0.00..215.67 rows=140000 width=12)

         ->  Hash  (cost=199.05..199.05 rows=10000 width=12)

               ->  Streaming(type: REDISTRIBUTE)  (cost=0.00..199.05 rows=10000 width=12)

                     Spawn on: All datanodes

                     ->  Seq Scan on b  (cost=0.00..130.33 rows=10000 width=12)

(11 rows)

 


【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。