客快物流大数据项目(八十八):ClickHouse快速入门
ClickHouse快速入门
一、安装ClickHouse(单机)
1、安装yum-utils工具包
yum install yum-utils
2、添加ClickHouse的yum源
yum-config-manager --add-repo https://repo.yandex.ru/clickhouse/rpm/stable/x86_64
3、安装ClickHouse的服务端和客户端
yum install -y clickhouse-server clickhouse-client
如果安装时出现warning: rpmts_HdrFromFdno: Header V4 RSA/SHA1 Signature, key ID e0c56bd4: NOKEY错误导致无法安装,需要在安装命令中添加—nogpgcheck来解决。
yum install -y clickhouse-server clickhouse-client --nogpgcheck
4、关于安装的说明
默认的配置文件路径是:/etc/clickhouse-server/
默认的日志文件路径是:/var/log/clickhouse-server/
5、查看ClickHouse的版本信息
clickhouse-client -m --host node2 --user root --password 123456
select version();
二、在命令行中操作ClickHouse
ClickHouse安装包中提供了clickhouse-client工具,这个客户端在运行shell环境中,使用TCP方式连接clickhouse-server服务。要运行该客户端工具可以选择使用交互式与非交互式(批量)两种模式:使用非交互式查询时需要指定--query参数;在交互模式下则需要注意是否使用—mutiline参数来开启多行模式。
clickhouse-client提供了很多参数可供使用,常用的参数如下表:
参数 |
介绍 |
---|---|
--host,-h |
服务端的 host 名称, 默认是 'localhost'。 您可以选择使用 host 名称或者 IPv4 或 IPv6 地址。 |
--port |
连接服务端的端口,默认值9000 |
--user,-u |
访问的用户名,默认default |
--password |
访问用户的密码,默认空字符串 |
--query,-q |
非交互模式下的查询语句 |
--database,-d |
连接的数据库,默认是default |
--multiline,-m |
使用多行模式,在多行模式下,回车键仅表示换行。默认不使用多行模式。 |
--multiquery,-n |
使用”,”分割的多个查询,仅在非交互模式下有效 |
--format, -f |
使用指定格式化输出结果 |
--vertical, -E |
使用垂直格式输出,即每个值使用一行显示 |
--time, -t |
打印查询时间到stderr中 |
--stacktrace |
如果出现异常,会打印堆栈跟踪信息 |
--config-file |
使用指定配置文件 |
--use_client_time_zone |
使用服务端时区 |
quit,exit |
表示退出客户端 |
Ctrl+D,Ctrl+C |
表示退出客户端 |
登录
clickhouse-client -m --host node2 --user root --password 123456
三、样例数据导入
1、编写下载航班数据脚本
创建名为 clickhouse-example-data-download.sh 的脚本文件
vim clickhouse-example-data-download.sh
for s in `seq 2017 2022`
do
for m in `seq 1 12`
do
wget https://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_${s}_${m}.zip
done
done
2、下载航班数据
chmod +x clickhouse-example-data-download.sh
./clickhouse-example-data-download.sh
3、创建ontime表
CREATE TABLE `ontime` (
`Year` UInt16,
`Quarter` UInt8,
`Month` UInt8,
`DayofMonth` UInt8,
`DayOfWeek` UInt8,
`FlightDate` Date,
`UniqueCarrier` FixedString(7),
`AirlineID` Int32,
`Carrier` FixedString(2),
`TailNum` String,
`FlightNum` String,
`OriginAirportID` Int32,
`OriginAirportSeqID` Int32,
`OriginCityMarketID` Int32,
`Origin` FixedString(5),
`OriginCityName` String,
`OriginState` FixedString(2),
`OriginStateFips` String,
`OriginStateName` String,
`OriginWac` Int32,
`DestAirportID` Int32,
`DestAirportSeqID` Int32,
`DestCityMarketID` Int32,
`Dest` FixedString(5),
`DestCityName` String,
`DestState` FixedString(2),
`DestStateFips` String,
`DestStateName` String,
`DestWac` Int32,
`CRSDepTime` Int32,
`DepTime` Int32,
`DepDelay` Int32,
`DepDelayMinutes` Int32,
`DepDel15` Int32,
`DepartureDelayGroups` String,
`DepTimeBlk` String,
`TaxiOut` Int32,
`WheelsOff` Int32,
`WheelsOn` Int32,
`TaxiIn` Int32,
`CRSArrTime` Int32,
`ArrTime` Int32,
`ArrDelay` Int32,
`ArrDelayMinutes` Int32,
`ArrDel15` Int32,
`ArrivalDelayGroups` Int32,
`ArrTimeBlk` String,
`Cancelled` UInt8,
`CancellationCode` FixedString(1),
`Diverted` UInt8,
`CRSElapsedTime` Int32,
`ActualElapsedTime` Int32,
`AirTime` Int32,
`Flights` Int32,
`Distance` Int32,
`DistanceGroup` UInt8,
`CarrierDelay` Int32,
`WeatherDelay` Int32,
`NASDelay` Int32,
`SecurityDelay` Int32,
`LateAircraftDelay` Int32,
`FirstDepTime` String,
`TotalAddGTime` String,
`LongestAddGTime` String,
`DivAirportLandings` String,
`DivReachedDest` String,
`DivActualElapsedTime` String,
`DivArrDelay` String,
`DivDistance` String,
`Div1Airport` String,
`Div1AirportID` Int32,
`Div1AirportSeqID` Int32,
`Div1WheelsOn` String,
`Div1TotalGTime` String,
`Div1LongestGTime` String,
`Div1WheelsOff` String,
`Div1TailNum` String,
`Div2Airport` String,
`Div2AirportID` Int32,
`Div2AirportSeqID` Int32,
`Div2WheelsOn` String,
`Div2TotalGTime` String,
`Div2LongestGTime` String,
`Div2WheelsOff` String,
`Div2TailNum` String,
`Div3Airport` String,
`Div3AirportID` Int32,
`Div3AirportSeqID` Int32,
`Div3WheelsOn` String,
`Div3TotalGTime` String,
`Div3LongestGTime` String,
`Div3WheelsOff` String,
`Div3TailNum` String,
`Div4Airport` String,
`Div4AirportID` Int32,
`Div4AirportSeqID` Int32,
`Div4WheelsOn` String,
`Div4TotalGTime` String,
`Div4LongestGTime` String,
`Div4WheelsOff` String,
`Div4TailNum` String,
`Div5Airport` String,
`Div5AirportID` Int32,
`Div5AirportSeqID` Int32,
`Div5WheelsOn` String,
`Div5TotalGTime` String,
`Div5LongestGTime` String,
`Div5WheelsOff` String,
`Div5TailNum` String
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192)
4、导入数据
创建名为 import.sh 的脚本文件
vim import.sh
for i in *.zip; do echo $i; unzip -cq $i '*.csv' | sed 's/\.00//g' | clickhouse-client --host=node2 --query="INSERT INTO default.ontime FORMAT CSVWithNames"; done
chmod +x import.sh
./import.sh
四、简单查询
- 查询总条数
select count(1) from ontime;
- 查询从2017年到2022年每天的航班数
SELECT DayOfWeek, count(1) AS c FROM ontime WHERE Year>=2017 AND Year<=2022 GROUP BY DayOfWeek ORDER BY c DESC;
- 查询从2017年到2022年每周延误超过10分钟的航班数。
SELECT DayOfWeek, count(1) AS c FROM ontime WHERE DepDelay>10 AND Year>=2017 AND Year<=2022 GROUP BY DayOfWeek ORDER BY c DESC;
- 查询2017年到2022年每个机场延误超过10分钟以上的次数
SELECT Origin, count(1) AS c FROM ontime WHERE DepDelay>10 AND Year>=2017 AND Year<=2022 GROUP BY Origin ORDER BY c DESC LIMIT 10;
- 查询2022年各航空公司延误超过10分钟以上的次数
SELECT Carrier, count(1) FROM ontime WHERE DepDelay>10 AND Year=2022 GROUP BY Carrier ORDER BY count(1) DESC;
- 查询2022年各航空公司延误超过10分钟以上的百分比
SELECT Carrier, avg(DepDelay>10)*100 AS c3 FROM ontime WHERE Year=2022 GROUP BY Carrier ORDER BY Carrier;
- 查询2017年到2022年期间各航空公司延误超过10分钟以上的百分比
SELECT Carrier, avg(DepDelay>10)*100 AS c3 FROM ontime WHERE Year>=2017 AND Year<=2022 GROUP BY Carrier ORDER BY Carrier;
- 每年航班延误超过10分钟的百分比
SELECT Year, avg(DepDelay>10) FROM ontime GROUP BY Year ORDER BY Year;
- 每年更受人们喜爱的目的地
SELECT DestCityName, uniqExact(OriginCityName) AS u FROM ontime WHERE Year >= 2017 and Year < 2022 GROUP BY DestCityName ORDER BY u DESC LIMIT 10;
uniqExact:可以对指定的列进行去重操作,去重以后计算数量
- 每一年的航班总次数
SELECT Year, count(1) AS c1 FROM ontime GROUP BY Year;
- 点赞
- 收藏
- 关注作者
评论(0)