用Java在Spring Boot中读取超大StackOverflow XML

举报
Jet Ding 发表于 2020/09/30 16:50:15 2020/09/30
【摘要】 编程语言:Java框架:Spring Boot测试文件:73G private void loadXML() { SAXParserFactory factory = SAXParserFactory.newInstance(); try { SAXParser parser = factory.newSAXParser(); ...

编程语言:Java
框架:Spring Boot
测试文件:73G

    private void loadXML() {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        try {

            SAXParser parser = factory.newSAXParser(); 
            File file = new File("D:/data/stackoverflow/stackoverflow.com-Posts/PostsCopy.xml"); 
            RowHandler rowHandler = new RowHandler(); 
            parser.parse(file, rowHandler); 

        } catch (ParserConfigurationException | IOException | SAXException e) { 
            e.printStackTrace(); 
        } 
    } 

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class RowHandler extends DefaultHandler { 
    public RowHandler() { 
    } 

    @Override 
    public void startElement(String uriString localNameString qNameAttributes attributesthrows SAXException { 

        for (int i = 0; i < attributes.getLength(); i++) { 
            System.out.println(attributes.getQName(i) + "=" + attributes.getValue(i)); 
        } 
    } 

    @Override 
    public void characters(char[] chint startint lengththrows SAXException { 
        // Unchanged from your implementation 
    } 

    @Override 
    public void endElement(String uriString localNameString qNamethrows SAXException { 
        // Elide code not needing change 
    } 
} 

运行效果:Id=42660
PostTypeId=2
ParentId=42649
CreationDate=2008-09-03T21:41:27.337
Score=1
Body=

<a href="http://www.netezza.com/" rel="nofollow noreferrer">Netezza</a> and other datawarehouse appliances scale this way, but they are not good for OLTP and web app workloads.

OwnerUserId=1293
OwnerDisplayName=Jason
LastActivityDate=2008-09-03T21:41:27.337
CommentCount=0
Id=42662
PostTypeId=2
ParentId=42357
CreationDate=2008-09-03T21:42:27.193
Score=1
Body=<blockquote>

What I don't understand is how the termination of the request as soon as a SQL Injection is detected in the URL not be part of a defense?

(I'm not claiming this to be the entire solution - just part of the defense.)

</blockquote>
<ul>
<li>Every database has its own extensions to SQL. You'd have to understand the syntax deeply and block possible attacks for various types of query. Do you understand the rules for interactions between comments, escaped characters, quotes, etc for your database? Probably not.</li>
<li>Looking for fixed strings is fragile. In your example, you block <code>cast(0x</code>, but what if the attacker uses <code>CAST (0x</code>? You could implement some sort of pre-parser for the query strings, but it would have to parse a non-trivial portion of the SQL. SQL is notoriously difficult to parse.</li>
<li>It muddies up the URL dispatch, view, and database layers. Your URL dispatcher will have to know which views use <code>SELECT</code>, <code>UPDATE</code>, etc and will have to know which database is used.</li>
<li>It requires active updating of the URL scanner. Every time a new injection is discovered -- and believe me, there will be <em>many</em> -- you'll have to update it. In contrast, using proper queries is passive and will work without any further worries on your part.</li>
<li>You'll have to be careful that the scanner never blocks legitimate URLs. Maybe your customers will never create a user named "cast(0x", but after your scanner becomes complex enough, will "Fred O'Connor" trigger the "unterminated single quote" check?</li>
<li>As mentioned by @chs, there are more ways to get data into an app than the query string. Are you prepared to test every view that can be <code>POST</code>ed to? Every form submission and database field?</li>
</ul>
OwnerUserId=3560
OwnerDisplayName=John Millikin
LastActivityDate=2008-09-03T21:42:27.193
CommentCount=0
Id=42665
PostTypeId=2
ParentId=42648
CreationDate=2008-09-03T21:42:55.693
Score=14
Body=

<strong>@@IDENTITY</strong> is the last identity inserted using the current SQL Connection. This is a good value to return from an insert stored procedure, where you just need the identity inserted for your new record, and don't care if more rows were added afterward.

<strong>SCOPE_IDENTITY</strong> is the last identity inserted using the current SQL Connection, and in the current scope -- that is, if there was a second IDENTITY inserted based on a trigger after your insert, it would not be reflected in SCOPE_IDENTITY, only the insert you performed. Frankly, I have never had a reason to use this.

<strong>IDENT_CURRENT(tablename)</strong> is the last identity inserted regardless of connection or scope. You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into.

OwnerUserId=2194
OwnerDisplayName=Guy Starbuck
LastEditorUserId=2194
LastEditDate=2017-10-10T21:23:44.847
LastActivityDate=2017-10-10T21:23:44.847
CommentCount=2
Id=42666
PostTypeId=2
ParentId=42550
CreationDate=2008-09-03T21:43:15.383
Score=1
Body=

Aza Raskin has talked about recognising when selected text is an address in his <a href="http://www.azarask.in/blog/post/new-tabs/" rel="nofollow noreferrer">Firefox Proposal: A Better New Tab Screen</a>. No code yet, but I mention it as there may be code in firefox to do this in the future.

Alternatively, you could look at using the <a href="https://wiki.mozilla.org/Labs/Ubiquity/Ubiquity_0.1_User_Tutorial#The_Map_command" rel="nofollow noreferrer">map command in Ubiquity</a>, although you'd have to select the addresses yourself.

OwnerUserId=2541
OwnerDisplayName=Sam Hasler
LastActivityDate=2008-09-03T21:43:15.383
CommentCount=0


【声明】本内容来自华为云开发者社区博主,不代表华为云及华为云开发者社区的观点和立场。转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息,否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。