- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

俄罗斯世界出口探索性数据分析

livingbody 发表于 2022/11/14 02:02:15 2022/11/14

4.3k+ 0 0

【摘要】 try : import circlify import pandas_profiling import ploty.expressexcept : !pip install circlify -i https://pypi.tuna.tsinghua.edu.cn/simple --user !pip install pandas_profiling -i ...

try :
    import circlify
    import pandas_profiling
    import ploty.express
except :
    !pip install circlify -i https://pypi.tuna.tsinghua.edu.cn/simple  --user
    !pip install pandas_profiling -i https://pypi.tuna.tsinghua.edu.cn/simple  --user
    !pip install plotly.express -i https://pypi.tuna.tsinghua.edu.cn/simple  --user
    !pip install -U pandas -i https://pypi.tuna.tsinghua.edu.cn/simple  --user

import warnings
warnings.filterwarnings("ignore")

🚢 俄罗斯世界出口探索性数据分析 📈

数据集介绍 📋

此数据为干净

数据包括俄罗斯对世界的出口
i. 225 合作国家
ii. ~20 年度
iii. ~3000 独特的商品 SITC Code description

Link to dataset

!python --version

Python 3.7.12

import pandas as pd
import numpy as np
from plotly import express as px
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

convert_dtypes() pandas

pandas 默认读取文本数据时，会自动读成object。
通过convert_dtypes()可以自动读成最新支持的类型，string

dataframe = pd.read_csv('/home/mw/input/Soviet_70yrs4904/RUStoWorldTrade.csv',
usecols=['Year','Aggregate Level','Reporter ISO','Partner','Partner ISO','Commodity Code','Commodity','Qty Unit','Qty','Netweight (kg)','Trade Value (US$)'])
dataframe = dataframe.convert_dtypes()

数据预处理 📋

数据集庞大，包含130万个数据点，因此它具有 650mb的高内存使用率, 我们将使用作者提供的 Starter Code 代码来优化性能。它将把数据类型转换为更小的数据类型，并帮助我们从五个聚合级别中选择一个 and 并将内存使用率降低95%

Aggregate Level	Level Name	Code Format	Number of Items
1	Section	0	10 Items
2	Division	01	67 Items
`3`	`Group`	`012`	`261 Items`
4	Subgroup	012.1	1033 Items
5	Item	012.13	3121 Items

info

Dataframe对象调用info()会显示DataFrame的内存是哟航情况。
memory_usage='deep’将启用更准确的内存使用报告，将更费内存。

通过memory_usage，查看各列内存使用量，然后就可以针对性进行内存优化了。

dataframe.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1384603 entries, 0 to 1384602
Data columns (total 11 columns):
 #   Column             Non-Null Count    Dtype 
---  ------             --------------    ----- 
 0   Year               1384603 non-null  Int64 
 1   Aggregate Level    1384603 non-null  Int64 
 2   Reporter ISO       1384603 non-null  string
 3   Partner            1384603 non-null  string
 4   Partner ISO        1375914 non-null  string
 5   Commodity Code     1384603 non-null  object
 6   Commodity          1384603 non-null  string
 7   Qty Unit           1384603 non-null  string
 8   Qty                1283894 non-null  Int64 
 9   Netweight (kg)     1323567 non-null  Int64 
 10  Trade Value (US$)  1384603 non-null  Int64 
dtypes: Int64(5), object(1), string(5)
memory usage: 638.8 MB

for col in dataframe.columns:
    if dataframe[col].dtype == 'Float64':
        dataframe[col] = dataframe[col].astype('float16')
    try : 
        if dataframe[col].dtype == 'Int64':
            dataframe[col] = dataframe[col].astype('int16')
    except :
        dataframe[col] = dataframe[col].astype('float16')
dataframe.drop(dataframe[dataframe['Commodity Code'] == 'TOTAL'].index, inplace=True)
dataframe['Commodity Code'] = dataframe['Commodity Code'].astype('float16')

dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1384426 entries, 0 to 1384602
Data columns (total 11 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Year               1384426 non-null  int16  
 1   Aggregate Level    1384426 non-null  int16  
 2   Reporter ISO       1384426 non-null  string 
 3   Partner            1384426 non-null  string 
 4   Partner ISO        1375737 non-null  string 
 5   Commodity Code     1384426 non-null  float16
 6   Commodity          1384426 non-null  string 
 7   Qty Unit           1384426 non-null  string 
 8   Qty                1283826 non-null  float16
 9   Netweight (kg)     1323499 non-null  float16
 10  Trade Value (US$)  1384426 non-null  int16  
dtypes: float16(3), int16(3), string(5)
memory usage: 79.2 MB

df = dataframe[dataframe['Aggregate Level']==2]
iso = pd.read_csv('/home/mw/input/Soviet_70yrs4904/iso3.csv')
iso.drop(axis=1, columns=['FIPS','ISO (2)','ISO (No)','Internet','Note','Capital'], inplace=True)

continents = ['Asia', 'Europe', 'Africa', 'Oceania', 'Americas']
for x in continents:        
    y = iso[iso['Continent'] == x]
    m = df['Partner ISO'].isin(y['ISO (3)'])
    df.loc[m, 'Continent'] = x

Region = ['South Asia', 'South East Europe', 'Northern Africa', 'Pacific',
       'South West Europe', 'Southern Africa', 'West Indies',
       'South America', 'South West Asia', 'Central Europe',
       'Eastern Europe', 'Western Europe', 'Central America',
       'Western Africa', 'South East Asia', 'Central Africa',
       'North America', 'East Asia', 'Indian Ocean', 'Northern Europe',
       'Eastern Africa', 'Southern Europe', 'Central Asia',
       'Northern Asia']

for x in Region:        
    y = iso[iso['Region'] == x]
    m = df['Partner ISO'].isin(y['ISO (3)'])
    df.loc[m, 'Region'] = x

此外，我们将使用提供的iso3数据集将Continents和SubContinents 添加到数据集中

在我们的操作之后，内存使用量已经减少了95%，降至40mb

df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 79030 entries, 11 to 1384534
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               79030 non-null  int16  
 1   Aggregate Level    79030 non-null  int16  
 2   Reporter ISO       79030 non-null  string 
 3   Partner            79030 non-null  string 
 4   Partner ISO        78318 non-null  string 
 5   Commodity Code     79030 non-null  float16
 6   Commodity          79030 non-null  string 
 7   Qty Unit           79030 non-null  string 
 8   Qty                31621 non-null  float16
 9   Netweight (kg)     31556 non-null  float16
 10  Trade Value (US$)  79030 non-null  int16  
 11  Continent          74838 non-null  object 
 12  Region             74838 non-null  object 
dtypes: float16(3), int16(3), object(2), string(5)
memory usage: 38.5 MB

Pandas Profiling 📋

了解数据，使用交互式元素进行浏览 🔬

pandas df.describe() 函数很强大，但对于严肃的探索性数据分析来说，这是一点基础。 pandas_profiling 用扩展pandad数据帧。profile_report()于快速数据分析。

year_list = [2010,2012,2014,2016,2018,2020]
top_countries_by_year = df[df['Year'].isin(year_list)]
top_countries_by_year = top_countries_by_year[top_countries_by_year['Trade Value (US$)'] >= 0]
top_countries_by_year = top_countries_by_year.groupby(['Year','Partner ISO']).agg({'Trade Value (US$)':'sum'}) ; top_countries_by_year.reset_index(inplace=True)
top_countries_by_year.sort_values(by=['Year','Trade Value (US$)'],ascending=[True,False], inplace=True)
top_countries_by_year = top_countries_by_year.groupby('Year').head(1)

EDA | 探索性数据分析 Exploratory Data Analysis 📊

I. 历年俄罗斯出口贸易价值 📅

俄罗斯总体上能够保持稳定的贸易价值，贸易额在 5-800万美元的范围内。拥有稳定的贸易确保了金融安全并帮助各国公民从各种选择中进行选择。

# 历年交易量

ts = df.groupby(['Year']).agg({'Trade Value (US$)':'sum'}) ; ts.reset_index(inplace=True)
fig = px.area(ts, x='Year', y='Trade Value (US$)', template='plotly_white')
fig.update_layout(font_family="verdana",font_size=12, paper_bgcolor="#e9f1f5",plot_bgcolor="#e9f1f5",title="Volume of Trade over Years<br><sup>Exports by Russia have been in a 'range' for the past years</sup>")
fig.update_yaxes(tickprefix="$", showgrid=False, range = [0,12500000])

fig.add_annotation(text='<b>Housing Crisis', x='2008', y=4200000, arrowhead=1, showarrow=True)
fig.add_shape(
    type='line', line_color='salmon', line_width=1, opacity=1, line_dash='dot',
    x0=2010, y0=0, x1=2010, y1=6053000, xref='x', yref='y') ; fig.add_annotation(text='<b>Germany', x='2010', y=6700000, showarrow=False)
fig.add_shape(
    type='line', line_color='salmon', line_width=1, opacity=1, line_dash='dot',
    x0=2012, y0=0, x1=2012, y1=6354000, xref='x', yref='y') ; fig.add_annotation(text='<b>Lithuania', x='2012', y=7000000, showarrow=False)
fig.add_shape(
    type='line', line_color='salmon', line_width=1, opacity=1, line_dash='dot',
    x0=2014, y0=0, x1=2014, y1=7722400, xref='x', yref='y') ; fig.add_annotation(text='<b>Mongolia', x='2014', y=8200000, showarrow=False)
fig.add_shape(
    type='line', line_color='salmon', line_width=1, opacity=1, line_dash='dot',
    x0=2016, y0=0, x1=2016, y1=8079900, xref='x', yref='y') ; fig.add_annotation(text='<b>Kyrgyzstan', x='2016', y=8500000, showarrow=False)
fig.add_shape(
    type='line', line_color='salmon', line_width=1, opacity=1, line_dash='dot',
    x0=2018, y0=0, x1=2018, y1=6054000, xref='x', yref='y') ; fig.add_annotation(text='<b>China', x='2018', y=6600000, showarrow=False)

fig.add_shape(
    type='line', line_color='black', line_width=1, opacity=1,
    x0=2010, y0=10000000, x1=2018, y1=10000000, xref='x', yref='y') ; fig.add_annotation(text='<b>Top Importers of Goods with respect to Year', x='2014', y=11000000, showarrow=False)
fig.add_shape(type='line', line_color='black', line_width=1, opacity=1,x0=2010, y0=9000000, x1=2010, y1=10000000, xref='x', yref='y')
fig.add_shape(type='line', line_color='black', line_width=1, opacity=1,x0=2018, y0=9000000, x1=2018, y1=10000000, xref='x', yref='y')
fig.add_shape(type='line', line_color='black', line_width=1, opacity=1,x0=2014, y0=10000000, x1=2014, y1=10500000, xref='x', yref='y')

fig.show()

cmap_chem = ['rgb(214,96,77)','#cde3ef','rgb(67,147,195)']
cmap_ec = ['rgb(178,24,43)','#cde3ef','rgb(33,102,172)']
circle_color = ['#67001f','#d6604d', '#9fcce2', '#4393c3','#053061']
# re-import
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

II. 各大洲贸易出口细目 🗺️

我们可以得出结论，大多数出口都是与邻国或陆地范围内的国家进行的（运输方式不需要海运/空运方式）。世界银行的一项研究表明，全球90%的货物是通过海运运输的。

.: Scale :.
High to Low
🟦 > 🟥

import math
millnames = ['',' K',' M',' B',' Tr']
import circlify
def millify(n):
    n = float(n)
    millidx = max(0,min(len(millnames)-1,
                        int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
    return '{:.0f}{}'.format(n / 10**(3 * millidx), millnames[millidx])

fig, ax = plt.subplots(figsize=(14,14))
ax.set_facecolor('#e9f1f5')
cc = df.copy()
cc = cc.groupby('Continent').agg({'Trade Value (US$)':'sum'}) ; cc.reset_index(inplace=True)
cc = cc[cc['Trade Value (US$)'] >= 0]
cc.sort_values(by='Trade Value (US$)', inplace=True)

labels = cc['Continent']
circles = circlify.circlify(
    cc['Trade Value (US$)'].tolist(), 
    show_enclosure=False, 
    target_enclosure=circlify.Circle(x=0, y=0, r=1))

lim = max(
    max(
        abs(circle.x) + circle.r,
        abs(circle.y) + circle.r,)
    for circle in circles)
plt.xlim(-lim-0.15, lim+0.15)
plt.ylim(-lim-0.15, lim+0.15)
ax.set(xticklabels=[],yticklabels=[])
# print circles
for circle, label, color, val in zip(circles, labels, circle_color,cc['Trade Value (US$)']) :
    x, y, r = circle
    ax.add_patch(plt.Circle((x, y),  r,alpha=0.9, linewidth=1, facecolor=color, edgecolor='black',))
    plt.annotate(label, (x,y ) ,va='center', ha='center',color='white', fontsize=12)# bbox=dict(facecolor='white', edgecolor='black', boxstyle='round', pad=.5))
    plt.annotate(millify(val), (x,y-0.04 ) ,va='center', ha='center',color='white', fontsize=10)
    plt.annotate('Minimum Imports', xy=(0.7,0.55), xytext=(0.85, 0.8),arrowprops=dict(arrowstyle="-[", connectionstyle="angle3,angleA=90,angleB=0"))
    plt.annotate('Maximum Imports', xy=(0.55,-0.54), xytext=(0.75, -0.75),arrowprops=dict(arrowstyle="-[",connectionstyle="angle3,angleA=180,angleB=90"))
    plt.annotate('Average Imports', xy=(-0.40,0.65), xytext=(-0.70, 0.85),arrowprops=dict(arrowstyle="-[",connectionstyle="angle3,angleA=90,angleB=180"))

total = df.copy()
total = total.groupby(['Continent','Partner ISO']).agg({'Trade Value (US$)':'sum'}) ; total.reset_index(inplace=True)
total = total.sort_values(by='Trade Value (US$)', ascending=False) ; total = total[total['Trade Value (US$)'] >= 0]
top5 = total.head(6)
top5 = top5['Partner ISO']

III. 按大陆和国家划分的贸易细分 🌎

下面的地图显示了迄今为止每个国家、每个大陆的总出口量。
.: Scale :.
High to Low
🟦 > 🟥

# Tree map
fig = px.treemap(total, path=[px.Constant('World'), 'Continent', 'Partner ISO'], values='Trade Value (US$)', color='Trade Value (US$)', hover_data=['Partner ISO'],
                  color_continuous_scale='RdBu',color_continuous_midpoint=np.percentile(total['Trade Value (US$)'],85), template='plotly_white')
fig.update_layout(font_family="verdana", font_size=12, paper_bgcolor="#e9f1f5",plot_bgcolor="#e9f1f5",title='<b>Top importers of Russian goods</b> <br><sup><i>Hover over blocks for more detail</i></sup>',)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

V. 所有商品的贸易密度|商品代码的核心密度w/ref贸易价值

在100种二级聚合商品中，最常交易的商品介于 40 到 80之间

**化学品-医药产品-钢铁-机械-道路车辆 **商品的主要贡献

# 所有商品的KDE

tt = df.copy()
tt = tt.groupby(['Commodity Code',]).agg({'Trade Value (US$)':'sum'}) ; tt.reset_index(inplace=True)
sns.jointplot(data=tt, x='Commodity Code', y='Trade Value (US$)', kind='kde',  cmap='Blues', shade=True, thresh=0, height=10, fill=True)

<seaborn.axisgrid.JointGrid at 0x7f6c50f7ff90>

VI. Top-5 商品贸易密度

国家	最高
印度	鞋类
美国	饮料
中国	染色、鞣制和着色材料
日本	专业机械
沙特阿拉伯	专业机械
德国	专业机械

top_com = df.copy() ; tc = top_com
top_com = top_com.groupby(['Commodity Code']).agg({'Trade Value (US$)':'sum'}) ; top_com.reset_index(inplace=True)
top_com.sort_values(by='Trade Value (US$)', inplace=True, ascending=False)
top_com = top_com['Commodity Code'].head(5)

fig, ax = plt.subplots(3,2,figsize=(18,17))
countries = ['IND','USA','CHN','JPN','SAU','DEU']
f6_pallet = ['#67001f','#d6604d', '#9fcce2', '#4393c3','#053061']
fig.suptitle('Density of Top-5 Commodities amongst different Countries')
labels=['Footwear','Beverages','Dyeing, Tanning and Coloring meterials','Specialized Machinery','Telco,Sound recording/reproducing Equipment']
fig.patch.set_facecolor('#e9f1f5')
count = 0
for i in range(3):
    for j in range(2):
        f6 = df[(df['Partner ISO'] == countries[count]) & (df['Commodity Code'].isin(top_com))]
        f6 = f6.groupby(['Year','Commodity Code']).agg({'Trade Value (US$)':'sum'}) ;f6.reset_index(inplace=True)
        a = sns.kdeplot(data=f6, ax=ax[i,j], x='Trade Value (US$)', hue='Commodity Code', fill=True, alpha=0.4, palette = f6_pallet, legend=False,)
        a.set(xticklabels=[],yticklabels=[])
        ax[i,j].set_title(countries[count]) ; count = count + 1
fig.legend([ax], labels=labels, loc='upper left', ncol=5, mode='expand',borderaxespad=4,)

<matplotlib.legend.Legend at 0x7f6c4f2f99d0>

世界范围分布 🗺️

世界地图上俄罗斯流行商品的分布

衣服 👚 👓
能源 ⛽ 🛢️
饮品 🍺 ☕ 🍾

# 83	TRAVEL GOODS, HANDBAGS AND SIMILAR CONTAINERS
# 84	ARTICLES OF APPAREL AND CLOTHING ACCESSORIES
# 85	FOOTWEAR
one = df[df['Commodity Code'].isin([83,84,85])]
one = one.groupby(['Partner ISO']).agg({'Trade Value (US$)':'sum'}) ; one.reset_index(inplace=True)
# 32	COAL, COKE AND BRIQUETTES
# 33	PETROLEUM, PETROLEUM PRODUCTS AND RELATED MATERIALS
# 34	GAS, NATURAL AND MANUFACTURED
two = df[df['Commodity Code'].isin([32,33,34])]
two = two.groupby(['Partner ISO']).agg({'Trade Value (US$)':'sum'}) ; two.reset_index(inplace=True)
# 11	BEVERAGES
# 12	TOBACCO AND TOBACCO MANUFACTURES
three = df[df['Commodity Code'].isin([11,12])]
three = three.groupby(['Partner ISO']).agg({'Trade Value (US$)':'sum'}) ; three.reset_index(inplace=True)

Title = ['Top Importers of Fashionable Threads & Apparel<br><sup>Travel Goods, Footwear, Clothing apparel</sup>',
        'Top Importers of Energy materials<br><sup>Coal, Briquettes, Gas, Fuel</sup>',
        'Top Importers of Beverages<br><sup>Beverages, Tobacco, Tobacco manufactures</sup>',]
for x,y in zip([one,two,three],Title) :
    fig = px.choropleth(x,locations='Partner ISO', locationmode='ISO-3',color='Trade Value (US$)', color_continuous_scale='RdBu', color_continuous_midpoint=np.average(x['Trade Value (US$)']))
    fig.update_layout(font_family="verdana",font_size=12,title=y,paper_bgcolor="#e9f1f5",plot_bgcolor="#e9f1f5")
    fig.update_layout(
    autosize=False,
    width=1200,
    height=600,
    template= 'plotly',
) ; fig.show()

结论 🔎

对于有机/非有机化学品🧪 & 医药原料💊 💉, 前10个国家的平均进口价值为80000美元，而全球的总体平均进口价值约为2000美元

正在交易的前5种贵重商品的KDE密度表明，世界对俄罗斯的专用机械有着浓厚的兴趣

在100种二级综合商品中，最常交易的商品在40至80种之间，主要是化学品-医药产品-钢铁-机械-道路车辆商品

俄罗斯最大的货物进口国位于同一大陆，即亚洲。最高的是朝鲜，其次是新西兰和肯尼亚

我们可以得出结论，大多数出口都是与邻国或陆地范围内的国家进行的（运输方式不需要海运/空运方式）。世界银行的一项研究表明，全球90%的货物是俄罗斯，俄罗斯能够保持总体稳定的贸易价值，贸易额一直在500万至800万美元之间。拥有稳定的贸易可以确保金融安全，并帮助各国公民从各种选择中进行选择。

哈萨克斯坦和中国是俄罗斯电力的最大进口国

【声明】本内容来自华为云开发者社区博主，不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源（华为云社区）、文章链接、文章作者等基本信息，否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容，欢迎发送邮件进行举报，并提供相关证据，一经查实，本社区将立刻删除涉嫌侵权内容，举报邮箱： cloudbbs@huaweicloud.com

点赞
收藏
关注作者

作者其他文章

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

开发者空间

了解空间

工作台

开发工具

实战案例

空间活动

空间论坛

开发平台

软件开发生产线 CodeArts

AI平台ModelArts

数据治理中心 DataArts Studio

数字内容生产线 MetaStudio

精选服务

云数据库 GaussDB

云数据库 RDS for MySQL

MapReduce服务 MRS

数据仓库服务 DWS

分布式缓存服务Redis版

分布式消息服务 DMS

华为云实时音视频 SparkRTC

媒体处理 MPC

主机迁移服务 SMS

对象存储迁移服务 OMS

云消息服务 KooMessage

云手机服务 KooPhone

企业搜索服务 KooSearch

云地图服务 KooMap

更多开放能力

开发工具

API生态

CodeArts API

API Explorer

Terraform Explorer

KooCLI

API 中心

SDK 中心

开发服务

Codelabs

DevStar

低代码平台Astro

CodeArts IDE

Huawei Cloud Toolkit

Classroom

开发资源

开源镜像站

开源资源

开发语言

开发实践

入门精选

分发资源

企业应用中心

企业通用专区

教育专区

华为应用专区

政务云专区

硬件云服务商城

医疗健康专区

跳蚤市场

华为云开发者日

直播专区

开发者精品活动

DTSE Tech Talk

加入HCDE

加入HCDG

加入HCSD

加入HCWD

鲁班会

沃土云创计划·企业

沃土云创计划·高校

沃土云创计划·个人

沃土云创计划·开源共创

博客

论坛

专题

开发者榜单

学习路径

在线课程

动手实验

考试认证

培训服务

俄罗斯世界出口探索性数据分析

🚢 俄罗斯世界出口探索性数据分析 📈