- 微信
- 微博
  
  分享文章到微博
- 复制链接
  
  复制链接到剪贴板

用golang读取超大StackOverflow XML文件

Jet Ding 发表于 2020/09/30 16:57:49 2020/09/30

【摘要】本文旨在为golang环境下读取超大XML文件提供一种解决方案。编程语言：golang测试文件大小： 73G程序示例：import ( "encoding/xml" "fmt" "os")func loadXml() { xmlFile, err := os.Open("D:/data/stackoverflow/stackoverflow.com-Posts/Po...

本文旨在为golang环境下读取超大XML文件提供一种解决方案。

编程语言：
golang

测试文件大小： 73G

程序示例：

import (
    "encoding/xml"
    "fmt"
    "os"
)

func loadXml() {
    xmlFile, err := os.Open("D:/data/stackoverflow/stackoverflow.com-Posts/PostsCopy.xml")
    if err != nil {
        panic(err)
    }
    decoder := xml.NewDecoder(xmlFile)
    for {
        // Read tokens from the XML document in a stream.
        t, _ := decoder.Token()
        if t == nil {
            break
        }
        // Inspect the type of the token just read.
        switch unit := t.(type) {
        case xml.StartElement:
            // If we just read a StartElement token
            if unit.Name.Local == "row" {
                for i:= 0; i < len(unit.Attr); i++{
                    fmt.Println(unit.Attr[i].Name.Local, "=", unit.Attr[i].Value)
                }
            }
        }
    }
}

运行结果：

OwnerUserId = 96OwnerDisplayName = Chris
LastEditorUserId = 4551041
LastEditorDisplayName = Chris Marasti-Georg
LastEditDate = 2018-12-13T17:30:17.253
LastActivityDate = 2019-09-08T02:35:05.200
Title = How do I print an HTML document from a web service?
Tags = <c#><html><web-services><printing>
AnswerCount = 7
CommentCount = 4
FavoriteCount = 7
Id = 175
PostTypeId = 1
CreationDate = 2008-08-01T18:36:14.070
Score = 48
ViewCount = 6956
Body =

I want to be able to display a normal YouTube video with overlaid annotations, consisting of coloured rectangles for each frame. The only requirement is that this should be done programmatically.

YouTube has annotations now, but require you to use their front end to create them by hand. I want to be able to generate them. What's the best way of doing this?

Some ideas:

<blockquote>
<ol>
<li>Build your own Flash player (ew?)</li>
<li>Somehow draw over the YouTube Flash player. Will this work?</li>
<li>Reverse engineer & hijack YouTube's annotation system. Either messing with the local files or redirecting its attempt to download
the annotations. (using Greasemonkey? Firefox plugin?)</li>
</ol>
</blockquote>

Idea that doesn't count:

download the video

</blockquote>
OwnerUserId = 2089740
LastEditorUserId = 1836618
LastEditDate = 2017-08-21T08:54:47.013
LastActivityDate = 2019-05-23T23:54:08.903
Title = Annotating YouTube videos programmatically
Tags = <youtube><reverse-engineering>
AnswerCount = 3
CommentCount = 0
FavoriteCount = 2
ClosedDate = 2019-05-24T02:46:43.597
Id = 176
PostTypeId = 1
AcceptedAnswerId = 207
CreationDate = 2008-08-01T18:37:40.150
Score = 111
ViewCount = 95962
Body =

On one Linux Server running Apache and PHP 5, we have multiple Virtual Hosts with separate log files. We cannot seem to separate the php <code>error_log</code> between virtual hosts.

Overriding this setting in the <code><Location></code> of the <code>httpd.conf</code> does not seem to do anything.

Is there a way to have separate php <code>error_logs</code> for each Virtual Host?

OwnerUserId = 91
LastEditorUserId = 3919949
LastEditDate = 2018-01-30T17:47:19.737
LastActivityDate = 2019-05-01T11:35:00.707
Title = error_log per Virtual Host?
Tags = <linux><apache><virtualhost>
AnswerCount = 11
CommentCount = 0
FavoriteCount = 23
Id = 180
PostTypeId = 1
AcceptedAnswerId = 539
CreationDate = 2008-08-01T18:42:19.343
Score = 67
ViewCount = 16399
Body =

This is something I've pseudo-solved many times and have never quite found a solution for.

The problem is to come up with a way to generate <code>N</code> colors, that are as distinguishable as possible where <code>N</code> is a parameter.

OwnerUserId = 2089740
LastEditorUserId = 5321363
LastEditDate = 2018-05-30T13:53:46.533
LastActivityDate = 2018-05-30T13:55:59.787
Title = Function for creating color wheels
Tags = <algorithm><language-agnostic><colors><color-space>
AnswerCount = 8
CommentCount = 1
FavoriteCount = 23
ClosedDate = 2017-08-16T09:55:30.000
Id = 183
PostTypeId = 2
ParentId = 123
CreationDate = 2008-08-01T18:51:12.090
Score = 65
Body =

Maybe this might help: <a href="http://jsefa.sourceforge.net/quick-tutorial.html" rel="noreferrer">JSefa</a>

You can read CSV file with this tool and serialize it to XML.

OwnerUserId = 86
LastEditorUserId = 395659
LastEditDate = 2013-02-07T10:36:04.283
LastActivityDate = 2013-02-07T10:36:04.283
CommentCount = 0
Id = 190
PostTypeId = 2
ParentId = 123
CreationDate = 2008-08-01T19:21:57.517
Score = 14
Body =

I don't understand why you would want to do this. It sounds almost like cargo cult coding.

Converting a CSV file to XML doesn't add any value. Your program is already reading the CSV file, so arguing that you need XML doesn't work.

On the other hand, reading the CSV file, doing <em>something</em> with the values, and then serializing to XML does make sense (well, as much as using XML can make sense... ;)) but you would supposedly already have a means of serializing to XML.

OwnerUserId = 55
LastActivityDate = 2008-08-01T19:21:57.517
CommentCount = 0
Id = 192
PostTypeId = 1
AcceptedAnswerId = 258
CreationDate = 2008-08-01T19:23:13.117
Score = 63
ViewCount = 2711
Body =

One of the fun parts of multi-cultural programming is number formats.

<ul>
<li>Americans use 10,000.50</li>
<li>Germans use 10.000,50</li>
<li>French use 10 000,50</li>
</ul>

My first approach would be to take the string, parse it backwards until I encounter a separator and use this as my decimal separator. There is an obvious flaw with that: 10.000 would be interpreted as 10.

Another approach: if the string contains 2 different non-numeric characters, use the last one as the decimal separator and discard the others. If I only have one, check if it occurs more than once and discards it if it does. If it only appears once, check if it has 3 digits after it. If yes, discard it, otherwise, use it as decimal separator.

The obvious "best solution" would be to detect the User's culture or Browser, but that does not work if you have a Frenchman using an en-US Windows/Browser.

Does the .net Framework contain some mythical black magic floating point parser that is better than <code>Double.(Try)Parse()</code> in trying to auto-detect the number format?

OwnerUserId = 91
LastEditorUserId = 567854
LastEditorDisplayName = Michael Stum
LastEditDate = 2019-01-20T13:48:20.567
LastActivityDate = 2019-04-17T14:27:09.093
Title = Floating Point Number parsing: Is there a Catch All algorithm?
Tags = <c#><.net><asp.net><internationalization><globalization>
AnswerCount = 4
CommentCount = 0
FavoriteCount = 0
Id = 194
PostTypeId = 1
AcceptedAnswerId = 197
CreationDate = 2008-08-01T19:26:37.883
Score = 35
ViewCount = 3853
Body =

Yes, I know. The existence of a running copy of <code>SQL Server 6.5</code> in 2008 is absurd.

That stipulated, what is the best way to migrate from <code>6.5</code> to <code>2005</code>? Is there any direct path? Most of the documentation I've found deals with upgrading <code>6.5</code> to <code>7</code>.

Should I forget about the native <code>SQL Server</code> upgrade utilities, script out all of the objects and data, and try to recreate from scratch?

I was going to attempt the upgrade this weekend, but server issues pushed it back till next. So, any ideas would be welcomed during the course of the week.

<em>Update. This is how I ended up doing it:</em>

<ul>
<li>Back up the database in question and Master on <code>6.5</code>.</li>
<li>Execute <code>SQL Server 2000</code>'s <code>instcat.sql</code> against <code>6.5</code>'s Master. This allows <code>SQL Server 2000</code>'s OLEDB provider to connect to <code>6.5</code>.</li>
<li>Use <code>SQL Server 2000</code>'s standalone <code>"Import and Export Data"</code> to create a DTS package, using <code>OLEDB</code> to connect to 6.5. This successfully copied all <code>6.5</code>'s tables to a new <code>2005</code> database (also using <code>OLEDB</code>).</li>
<li>Use <code>6.5</code>'s Enterprise Manager to script out all of the database's indexes and triggers to a .sql file.</li>
<li>Execute that .sql file against the new copy of the database, in 2005's Management Studio.</li>
<li>Use 6.5's Enterprise Manager to script out all of the stored procedures.</li>
<li>Execute that <code>.sql</code> file against the <code>2005</code> database. Several dozen sprocs had issues making them incompatible with <code>2005</code>. Mainly <code>non-ANSI joins</code> and <code>quoted identifier issues</code>.</li>
<li>Corrected all of those issues and re-executed the <code>.sql</code> file.</li>
<li>Recreated the <code>6.5</code>'s logins in <code>2005</code> and gave them appropriate permissions.</li>
</ul>

There was a bit of rinse/repeat when correcting the stored procedures (there were hundreds of them to correct), but the upgrade went great otherwise.

Being able to use Management Studio instead of <code>Query Analyzer</code> and <code>Enterprise Manager 6.5</code> is such an amazing difference. A few report queries that took 20-30 seconds on the <code>6.5 database</code> are now running in 1-2 seconds, without any modification, new indexes, or anything. I didn't expect that kind of immediate improvement.

OwnerUserId = 60
LastEditorUserId = 165216
LastEditorDisplayName = Nigel Campbell
LastEditDate = 2012-06-12T05:35:46.927
LastActivityDate = 2017-01-17T09:38:49.320
Title = Upgrading SQL Server 6.5
Tags = <sql-server><migration>
AnswerCount = 4
CommentCount = 0
FavoriteCount = 2
Id = 195
PostTypeId = 2
ParentId = 173
CreationDate = 2008-08-01T19:28:25.447
Score = 39

点赞
收藏
关注作者

0/1000

抱歉，系统识别当前为高风险访问，暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称，即可参与社区互动！

*长度不超过10个汉字或20个英文字符，设置后3个月内不可修改。

确认取消

加入云驻计划，成为创作者

华为云周边好礼
免费体验产品
特殊身份标识
线下官方门票
内部专家零距离
与10000+优质创作者共同成长

立即加入

用golang读取超大StackOverflow XML文件

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

用golang读取超大StackOverflow XML文件

全部回复

设置昵称

关于作者

目录

加入云驻计划，成为创作者

推荐阅读

相关产品