- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

微博数据各字段的含义

冬晨夕阳发表于 2022/03/30 01:54:19 2022/03/30

【摘要】最近在写微博的爬虫，框架已经基本稳定，但是在解析各字段含义的环节卡了好几天，因为不清楚各个字段的含义，官网的api注释好像有点过时，很多字段没有注释，所以只能自己一点一点分析了移动端得到的微博数据是j...

最近在写微博的爬虫，框架已经基本稳定，但是在解析各字段含义的环节卡了好几天，因为不清楚各个字段的含义，官网的api注释好像有点过时，很多字段没有注释，所以只能自己一点一点分析了

移动端得到的微博数据是json格式的，获得一个页面的数据以后，设为data,则
data[‘cards’][0][‘card_group’]
能够获得一个数组，数组内每个元素都是一行微博，里面包含了发布时间，微博内容，发布用户，转载内容等等。具体的字段有：

'idstr',                    #等同于id，是str形式
'id',                       #信息id
'created_timestamp',        #创建时间戳 ex:1448617509
'created_at',               #创建时间。但是要注意，如果是今年以前的数据，
                            #显示格式是'year-month-day hour:min:sec' 格式。
                            #而今年的数据则显示为'month-day hour:min:sec'格式
'attitudes_count',          #点赞数
'reposts_count',            #转发数
'comments_count',           #评论数目
'isLongText',               #是否是长微博（就目前来看，都是False）
'source',                   #用户客户端（iphone等）
'pid',                      #不明，但是有必要保存，不一定有值
'bid',                      #不明，但是有必要保存，不一定有值

# 图片信息--------------------------------------------------------------
'original_pic',             #图片相关，原始图片地址
'bmiddle_pic',              #图片地址，与original_pic相比，只是把large换位bmiddle
'thumbnail_pic',            #地址似乎等于pic中图片地址，尺寸是thumb
'pic_ids',                  #图片id，是个Array
'pics',                     #如果包含图的话，有该项，是一个数组，内嵌字典，
                            # 包括size,pid,geo,url等

# 以下字段内含数组或字典格式，需要进一步处理----------------------------------
'retweeted_status',         #转载信息。如果该微博内有转载信息，则含有该项。转载项字段与本微博一致
'user',                     #用户信息，字典格式。其中['uid']与['name']分别表示用户的id和名字
'page_info',                #页面内嵌的链接的信息。比如外链，文章，视频，地理信息专题等内容。
'topic_struct',             #话题信息，是数组格式，内涵字典，其中有'topic_title'项。
'text',                     #文本信息。内涵表情，外链，回复，图像，话题等信息。

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28

这里要讲一下text项的处理问题。由于text项其实是一段html代码，所以也可以用网页分析包（如python的beautifulsoup, java 的 jsoup 等）来分析，但是一则没有必要且速度慢，二则在新的云主机上配置客户端的时候还要下依赖包很麻烦，所以就使用正则表达式来分析了

随便调了一个文本过来，是这样的

<a class="k" href="/k/%E9%9F%B3%E4%B9%90%E4%B8%8B%E5%8D%88%E8%8C%B6?from=feed">#音乐下午茶#</a>
未觉池塘春草梦，阶前梧叶已秋声。雨侵坏瓮新苔绿，秋入横林数叶红。夜深风竹敲秋韵，万叶千声皆是恨。人人解说悲秋事，不似诗人彻底知...<a data url=http://t.cn/zRMpLeo href="http://weibo.cn/sinaurl?u=http%3A%2F%2Ft.cn%2FzRMpLeo%3Furl_type%3D1%26object_type%3D%26pos%3D1&ep=AfF1y1cpN%2C2590506210%2CAfF1y1cpN%2C2590506210" class=""><i class="iconimg iconimg-xs"><img src="http://u1.sinaimg.cn/upload/2014/10/16/timeline_card_small_video_default.png"></i><span class="surl-text">视频</span></a>

  
 
  1
  2

在上述文本中，包含了一个topic链接（#音乐下午茶#），一个外链链接（视频）还有文本（未觉池塘春草梦，阶前梧叶已秋声。雨侵坏瓮新苔绿，秋入横林数叶红。夜深风竹敲秋韵，万叶千声皆是恨。人人解说悲秋事，不似诗人彻底知…）

各项相对于的正则表达式如下所示

<i.+?</i>           #表情，也可能内含一些东西
<a class="k".+?</a> #话题         
<a href.+?</a>      #用户链接               
回复.+?//            #回复              
\[.+?\]             #表情             
<a data-url.+?</a>  #一般表示外链，表示视频，网页等等
<img.+?>            #图像

  
 
  1
  2
  3
  4
  5
  6
  7

下面是相对应的python代码，都包含在parseMicroblogPage类中。当获得页面数据之后，调用其中的parse_blog_page函数，即会返回一个数组，里面包含了处理过以后的微博数据

class parseMicroblogPage():

    def __init__(self):
        self.p_face=re.compile(r'\[.+?\]')
        self.p_face_i=re.compile(r'<i.+?</i>')
        self.p_user=re.compile(r'<a href.+?</a>')
        self.p_topic=re.compile(r'<a class="k".+?</a>')
        self.p_reply=re.compile(r'回复.+?//')
        self.p_link=re.compile(r'<a data-url.+?</a>')
        self.p_img=re.compile(r'<img.+?>')
        self.p_span=re.compile(r'<span.+?</span>')
        self.p_http_png=re.compile(r'http://.+?png')

    def parse_blog_page(self,data):
        try:        # check if the page is json type
            data=json.loads(data)
        except:
            save_page(data)
            raise ValueError('Unable to parse page')

        try:        # check if the page is empty
            mod_type=data['cards'][0]['mod_type']
        except:
            save_page(json.dumps(data))
            raise ValueError('The type of this page is incorrect')

        if 'empty' in mod_type:
            raise ValueError('This page is empty')

        try:        # get card group as new data
            data=data['cards'][0]['card_group']
        except:
            save_page(json.dumps(data))
            raise ValueError('The type of this page is incorrect')

        data_list=[]
        for block in data:
            res=self.parse_card_group(block)
            data_list.append(res)

        return data_list

    def parse_card_group(self,data):
        data=data['mblog']
        msg=self.parse_card_inner(data)
        return msg

    def parse_card_inner(self,data):
        msg={}
        keys=list(data.keys())

        key_array=[
            # 基本信息--------------------------------------------------------------
            'idstr',                        #等同于id，是str形式
            'id',                           #信息id
            'created_timestamp',          #创建时间戳 ex:1448617509
            'attitudes_count',            #点赞数
            'reposts_count',              #转发数
            'comments_count',             #评论数目
            'isLongText',                  #是否是长微博（就目前来看，都是False）
            'source',                      #用户客户端（iphone等）
            'pid',                         #不明，但是有必要保存，不一定有值
            'bid',                         #不明，但是有必要保存，不一定有值
            # 图片信息--------------------------------------------------------------
            'original_pic',               #图片相关，原始图片地址
            'bmiddle_pic',                #图片地址，与original_pic相比，只是把large换位bmiddle
            'thumbnail_pic',              #地址似乎等于pic中图片地址，尺寸是thumb
            'pic_ids',                     #图片id，是个Array
            'pics',                        #如果包含图的话，有该项，是一个数组，内嵌字典，
                                            # 包括size,pid,geo,url等
        ]

        for item in keys:
            if item in key_array:
                msg[item]=data[item]

        #糅合 id , mid , msg_id
        if 'id' not in keys:
            if 'mid' in keys:
                msg['id']=data['mid']
            elif 'msg_id' in keys:
                msg['id']=data['msg_id']

        if 'attitudes_count' not in keys and 'like_count' in keys:
            msg['attitudes_count']=data['like_count']

        # created_at
        if 'created_at' in keys:
            if data['created_at'].__len__()>14:
                msg['created_at']=data['created_at']
            else:
                if 'created_timestamp' in keys:
                    stamp=data['created_timestamp']
                    x=time.localtime(stamp)
                    str_time=time.strftime('%Y-%m-%d %H:%M',x)
                    msg['created_at']=str_time
                else:
                    msg['created_at']=config.CURRENT_YEAR+'-'+data['created_at']

        # retweeted_status
        if 'retweeted_status' in keys:
            msg['retweeted_status']=self.parse_card_inner(data['retweeted_status'])
            msg['is_retweeted']=True
        else:
            msg['is_retweeted']=False

        # user
        if 'user' in keys:
            msg['user']=self.parse_user_info(data['user'])
            msg['user_id']=msg['user']['uid']
            msg['user_name']=msg['user']['name']

        # url_struct
        # msg['url_struct']=self.parse_url_struct(data['url_struct'])

        # page_info
        if 'page_info' in keys:
            msg['page_info']=self.parse_page_info(data['page_info'])

        # topic_struct
        if 'topic_struct' in keys:
            msg['topic_struct']=self.parse_topic_struct(data['topic_struct'])

        # text
        if 'text' in keys:
            msg['ori_text']=data['text']
            msg['dealed_text']=self.parse_text(data['text'])
        return msg

    def parse_user_info(self,user_data):
        keys=user_data.keys()
        user={}
        if 'id' in keys:
            user['uid']=str(user_data['id'])
        if 'screen_name' in keys:
            user['name']=user_data['screen_name']
        if 'description' in keys:
            user['description']=user_data['description']
        if 'fansNum' in keys:
            temp=user_data['fansNum']
            if isinstance(temp,str):
                temp=int(temp.replace('万','0000'))
            user['fans_num']=temp
        if 'gender' in keys:
            if user_data['gender']=='m':
                user['gender']='male'
            if user_data['gender']=='f':
                user['gender']='female'
        if 'profile_url' in keys:
            user['basic_page']='http://m.weibo.cn'+user_data['profile_url']
        if 'verified' in keys:
            user['verified']=user_data['verified']
        if 'verified_reason' in keys:
            user['verified_reason']=user_data['verified_reason']
        if 'statuses_count' in keys:
            temp=user_data['statuses_count']
            if isinstance(temp,str):
                temp=int(temp.replace('万','0000'))
            user['blog_num']=temp
        return user

    def parse_text(self,text):
        msg={}

        # data-url
        data_url=re.findall(self.p_link,text)
        if data_url.__len__()>0:
            data_url_list=[]
            for block in data_url:
                temp=self.parse_text_data_url(block)
                data_url_list.append(temp)
            msg['data_url']=data_url_list
        text=re.sub(self.p_link,'',text)

        # topic
        topic=re.findall(self.p_topic,text)
        if topic.__len__()>0:
            topic_list=[]
            for block in topic:
                temp=self.parse_text_topic(block)
                topic_list.append(temp)
            msg['topic']=topic_list
        text=re.sub(self.p_topic,'',text)

        # moiton
        motion=[]
        res1=re.findall(self.p_face_i,text)
        for item in res1:
            temp=re.findall(self.p_face,item)[0]
            motion.append(temp)
        text=re.sub(self.p_face_i,'',text)

        res2=re.findall(self.p_face,text)
        motion=motion+res2
        if motion.__len__()>0:
            msg['motion']=motion
        text=re.sub(self.p_face,'',text)

        # user
        user=[]
        user_res=re.findall(self.p_user,text)
        if user_res.__len__()>0:
            for item in user_res:
                temp=self.parse_text_user(item)
                user.append(temp)
            msg['user']=user
        text=re.sub(self.p_user,'@',text)

        msg['left_content']=text.split('//')
        return msg

    def parse_text_data_url(self,text):
        link_data={}
        link_data['type']='data_url'

        try:
            res_face=re.findall(self.p_face_i,text)[0]
            res_img=re.findall(self.p_img,res_face)[0]
            res_http=re.findall(self.p_http_png,res_img)[0]
            link_data['img']=res_http

            res_class=re.findall(r'<i.+?>',text)[0]
            link_data['class']=res_class

            text=re.sub(self.p_face_i,'',text)
        except:
            pass

        try:
            res_span=re.findall(self.p_span,text)[0]
            title=re.findall(r'>.+?<',res_span)[0][1:-1]
            link_data['title']=title
            text=re.sub(self.p_span,'',text)
        except:
            pass

        try:
            data_url=re.findall(r'data-url=".+?"',text)[0]
            data_url=re.findall(r'".+?"',data_url)[0][1:-1]
            link_data['short_url']=data_url

            url=re.findall(r'href=".+?"',text)[0][6:-1]
            link_data['url']=url
        except:
            pass

        # print(text)
        # print(json.dumps(link_data,indent=4))
        return link_data

    def parse_text_topic(self,text):
        data={}

        try:
            data['type']='topic'
            data['class']=re.findall(r'class=".+?"',text)[0][7:-1]
            data['title']=re.findall(r'>.+?<',text)[0][1:-1]
            data['url']='http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1]
        except:
            pass

        return data

    def parse_text_user(self,text):
        data={}
        data['type']='user'

        try:
            data['title']=re.findall(r'>.+?<',text)[0][2:-1]
            data['url']= 'http://m.weibo.cn'+re.findall(r'href=".+?"',text)[0][6:-1]
        except:
            pass

        return data

    def parse_url_struct(self,data):
        url_struct=[]
        for block in data:
            keys=block.keys()
            new_block=block
            url_struct.append(new_block)
        return url_struct

    def parse_page_info(self,data):
        keys=data.keys()
        key_array=[
            'page_url',
            'page_id',
            'content2',
            'tips',
            'page_pic',
            'page_desc',
            'object_type',
            'page_title',
            'content1',
            'type',
            'object_id'
        ]
        msg={}
        for item in keys:
            if item in key_array:
                msg[item]=data[item]
        return msg

    def parse_topic_struct(self,data):
        msg=[]
        for block in data:
            keys=block.keys()
            temp=block
            if 'topic_title' in keys:
                temp['topic_url']='http://m.weibo.cn/k/{topic}?from=feed'\
                    .format(topic=block['topic_title'])
            msg.append(temp)
        return msg

  
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
  100
  101
  102
  103
  104
  105
  106
  107
  108
  109
  110
  111
  112
  113
  114
  115
  116
  117
  118
  119
  120
  121
  122
  123
  124
  125
  126
  127
  128
  129
  130
  131
  132
  133
  134
  135
  136
  137
  138
  139
  140
  141
  142
  143
  144
  145
  146
  147
  148
  149
  150
  151
  152
  153
  154
  155
  156
  157
  158
  159
  160
  161
  162
  163
  164
  165
  166
  167
  168
  169
  170
  171
  172
  173
  174
  175
  176
  177
  178
  179
  180
  181
  182
  183
  184
  185
  186
  187
  188
  189
  190
  191
  192
  193
  194
  195
  196
  197
  198
  199
  200
  201
  202
  203
  204
  205
  206
  207
  208
  209
  210
  211
  212
  213
  214
  215
  216
  217
  218
  219
  220
  221
  222
  223
  224
  225
  226
  227
  228
  229
  230
  231
  232
  233
  234
  235
  236
  237
  238
  239
  240
  241
  242
  243
  244
  245
  246
  247
  248
  249
  250
  251
  252
  253
  254
  255
  256
  257
  258
  259
  260
  261
  262
  263
  264
  265
  266
  267
  268
  269
  270
  271
  272
  273
  274
  275
  276
  277
  278
  279
  280
  281
  282
  283
  284
  285
  286
  287
  288
  289
  290
  291
  292
  293
  294
  295
  296
  297
  298
  299
  300
  301
  302
  303
  304
  305
  306
  307
  308
  309
  310
  311
  312
  313
  314
  315
  316

文章来源: blog.csdn.net，作者：考古学家lx，版权归原作者所有，如需转载，请联系作者。

原文链接：blog.csdn.net/weixin_43582101/article/details/93858536

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

微博数据各字段的含义

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

微博数据各字段的含义

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品