python数据分析入门学习笔记
【摘要】 学习交流!
一.数据导入和导出
(一)读取csv文件
1.本地读取
import pandas as pd
df = pd.read_csv('E:\\tips.csv') #根据自己数据文件保存的路径填写(p.s. python填写路径时,要么使用/,要么使用\\)
#输出:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
5 25.29 4.71 Male No Sun Dinner 4
.. ... ... ... ... ... ... ...
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
[244 rows x 7 columns]
2.网络读取
import pandas as pd
data_url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv" #填写url读取
df = pd.read_csv(data_url)
#输出同上,为了节省篇幅这儿就不粘贴了
3.read_csv详解
功能: Read CSV (comma-separated) file into DataFrame
read_csv(filepath_or_buffer, sep=',', dialect=None, compression='infer', doublequote=True,
escapechar=None, quotechar='"', quoting=0, skipinitialspace=False, lineterminator=None,
header='infer', index_col=None, names=None, prefix=None, skiprows=None,
skipfooter=None, skip_footer=0, na_values=None, true_values=None, false_values=None,
delimiter=None, converters=None, dtype=None, usecols=None, engine=None,
delim_whitespace=False, as_recarray=False, na_filter=True, compact_ints=False,
use_unsigned=False, low_memory=True, buffer_lines=None, warn_bad_lines=True,
error_bad_lines=True, keep_default_na=True, thousands=None, comment=None,
decimal='.', parse_dates=False, keep_date_col=False, dayfirst=False, date_parser=None,
memory_map=False, float_precision=None, nrows=None, iterator=False, chunksize=None,
verbose=False, encoding=None, squeeze=False, mangle_dupe_cols=True, tupleize_cols=False,
infer_datetime_format=False, skip_blank_lines=True)
(二)读取Mysql数据
假设数据库安装在本地,用户名为myusername,密码为mypassword,要读取mydb数据库中的数据
import pandas as pd
import MySQLdb
mysql_cn= MySQLdb.connect(host='localhost', port=3306,user='myusername', passwd='mypassword',
db='mydb')
df = pd.read_sql('select * from test;', con=mysql_cn)
mysql_cn.close()
上面的代码读取了test表中所有的数据到df中,而df的数据结构为Dataframe。
ps.MySQL教程:http://www.runoob.com/mysql/mysql-tutorial.html
(三)读取excel文件要读取excel文件还需要安装xlrd模块,pip install xlrd即可。
df = pd.read_excel('E:\\tips.xls')
(四)数据导出到csv文件
df.to_csv('E:\\demo.csv', encoding='utf-8', index=False)
#index=False表示导出时去掉行名称,如果数据中含有中文,一般encoding指定为‘utf-8’
(五)读写SQL数据库
import pandas as pd
import sqlite3
con = sqlite3.connect('...')
sql = '...'
df=pd.read_sql(sql,con)
#help文件
help(sqlite3.connect)
#输出
Help on built-in function connect in module _sqlite3:
connect(...)
connect(database[, timeout, isolation_level, detect_types, factory])
Opens a connection to the SQLite database file *database*. You can use
":memory:" to open a database connection to a database that resides in
RAM instead of on disk.
#############
help(pd.read_sql)
#输出
Help on function read_sql in module pandas.io.sql:
read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
Read SQL query or database table into a DataFrame.
二.提取和筛选需要的数据
(一)提取和查看相应数据 (用的是tips.csv的数据,数据来源:https://github.com/mwaskom/seaborn-data)
print df.head() #打印数据前五行
#输出
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
print df.tail() #打印数据后5行
#输出
total_bill tip sex smoker day time size
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
print df.columns #打印列名
#输出
Index([u'total_bill', u'tip', u'sex', u'smoker', u'day', u'time', u'size'], dtype='object')
print df.index #打印行名
#输出
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
234, 235, 236, 237, 238, 239, 240, 241, 242, 243],
dtype='int64', length=244)
print df.ix[10:20, 0:3] #打印10~20行前三列数据
#输出
total_bill tip sex
10 10.27 1.71 Male
11 35.26 5.00 Female
12 15.42 1.57 Male
13 18.43 3.00 Male
14 14.83 3.02 Female
15 21.58 3.92 Male
16 10.33 1.67 Female
17 16.29 3.71 Male
18 16.97 3.50 Female
19 20.65 3.35 Male
20 17.92 4.08 Male
#提取不连续行和列的数据,这个例子提取的是第1,3,5行,第2,4列的数据
df.iloc[[1,3,5],[2,4]]
#输出
sex day
1 Male Sun
3 Male Sun
5 Male Sun
#专门提取某一个数据,这个例子提取的是第三行,第二列数据(默认从0开始算哈)
df.iat[3,2]
#输出
'Male'
print df.drop(df.columns[1, 2], axis = 1) #舍弃数据前两列
print df.drop(df.columns[[1, 2]], axis = 0) #舍弃数据前两行
#为了节省篇幅结果就不贴出来了哈~
print df.shape #打印维度
#输出
(244, 7)
df.iloc[3] #选取第3行
#输出1
total_bill 23.68
tip 3.31
sex Male
smoker No
day Sun
time Dinner
size 2
Name: 3, dtype: object
df.iloc[2:4] #选取第2到第3行
#输出2
total_bill tip sex smoker day time size
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
df.iloc[0,1] #选取第0行1列的元素
#输出3
1.01
(二)筛选出需要的数据(用的是tips.csv的数据,数据来源:https://github.com/mwaskom/seaborn-data)
#example:假设我们要筛选出小费大于$8的数据
df[df.tip>8]
#输出
total_bill tip sex smoker day time size
170 50.81 10 Male Yes Sat Dinner 3
212 48.33 9 Male No Sat Dinner 4
#数据筛选同样可以用”或“和”且“作为筛选条件,比如
#1
df[(df.tip>7)|(df.total_bill>50)] #筛选出小费大于$7或总账单大于$50的数据
#输出
total_bill tip sex smoker day time size
23 39.42 7.58 Male No Sat Dinner 4
170 50.81 10.00 Male Yes Sat Dinner 3
212 48.33 9.00 Male No Sat Dinner 4
#2
df[(df.tip>7)&(df.total_bill>50)]#筛选出小费大于$7且总账单大于$50的数据
#输出
total_bill tip sex smoker day time size
170 50.81 10 Male Yes Sat Dinner 3
#接上
#假如加入了筛选条件后,我们只关心day和time
df[['day','time']][(df.tip>7)|(df.total_bill>50)]
#输出
day time
23 Sat Dinner
170 Sat Dinner
212 Sat Dinner
三.数据描述(用的是tips.csv的数据,数据来源:https://github.com/mwaskom/seaborn-data)
print df.describe() #描述性统计
#输出 各指标都比较简单就不解释了哈
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000
【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)