用Java在Spring Boot中读取超大StackOverflow XML
编程语言:Java
框架:Spring Boot
测试文件:73G
private void loadXML() {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
SAXParser parser = factory.newSAXParser();
File file = new File("D:/data/stackoverflow/stackoverflow.com-Posts/PostsCopy.xml");
RowHandler rowHandler = new RowHandler();
parser.parse(file, rowHandler);
} catch (ParserConfigurationException | IOException | SAXException e) {
e.printStackTrace();
}
}
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class RowHandler extends DefaultHandler {
public RowHandler() {
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println(attributes.getQName(i) + "=" + attributes.getValue(i));
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
// Unchanged from your implementation
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change
}
}
运行效果:Id=42660
PostTypeId=2
ParentId=42649
CreationDate=2008-09-03T21:41:27.337
Score=1
Body=
<a href="http://www.netezza.com/" rel="nofollow noreferrer">Netezza</a> and other datawarehouse appliances scale this way, but they are not good for OLTP and web app workloads.
OwnerUserId=1293
OwnerDisplayName=Jason
LastActivityDate=2008-09-03T21:41:27.337
CommentCount=0
Id=42662
PostTypeId=2
ParentId=42357
CreationDate=2008-09-03T21:42:27.193
Score=1
Body=<blockquote>
What I don't understand is how the termination of the request as soon as a SQL Injection is detected in the URL not be part of a defense?
(I'm not claiming this to be the entire solution - just part of the defense.)
</blockquote>
<ul>
<li>Every database has its own extensions to SQL. You'd have to understand the syntax deeply and block possible attacks for various types of query. Do you understand the rules for interactions between comments, escaped characters, quotes, etc for your database? Probably not.</li>
<li>Looking for fixed strings is fragile. In your example, you block <code>cast(0x</code>, but what if the attacker uses <code>CAST (0x</code>? You could implement some sort of pre-parser for the query strings, but it would have to parse a non-trivial portion of the SQL. SQL is notoriously difficult to parse.</li>
<li>It muddies up the URL dispatch, view, and database layers. Your URL dispatcher will have to know which views use <code>SELECT</code>, <code>UPDATE</code>, etc and will have to know which database is used.</li>
<li>It requires active updating of the URL scanner. Every time a new injection is discovered -- and believe me, there will be <em>many</em> -- you'll have to update it. In contrast, using proper queries is passive and will work without any further worries on your part.</li>
<li>You'll have to be careful that the scanner never blocks legitimate URLs. Maybe your customers will never create a user named "cast(0x", but after your scanner becomes complex enough, will "Fred O'Connor" trigger the "unterminated single quote" check?</li>
<li>As mentioned by @chs, there are more ways to get data into an app than the query string. Are you prepared to test every view that can be <code>POST</code>ed to? Every form submission and database field?</li>
</ul>
OwnerUserId=3560
OwnerDisplayName=John Millikin
LastActivityDate=2008-09-03T21:42:27.193
CommentCount=0
Id=42665
PostTypeId=2
ParentId=42648
CreationDate=2008-09-03T21:42:55.693
Score=14
Body=
<strong>@@IDENTITY</strong> is the last identity inserted using the current SQL Connection. This is a good value to return from an insert stored procedure, where you just need the identity inserted for your new record, and don't care if more rows were added afterward.
<strong>SCOPE_IDENTITY</strong> is the last identity inserted using the current SQL Connection, and in the current scope -- that is, if there was a second IDENTITY inserted based on a trigger after your insert, it would not be reflected in SCOPE_IDENTITY, only the insert you performed. Frankly, I have never had a reason to use this.
<strong>IDENT_CURRENT(tablename)</strong> is the last identity inserted regardless of connection or scope. You could use this if you want to get the current IDENTITY value for a table that you have not inserted a record into.
OwnerUserId=2194
OwnerDisplayName=Guy Starbuck
LastEditorUserId=2194
LastEditDate=2017-10-10T21:23:44.847
LastActivityDate=2017-10-10T21:23:44.847
CommentCount=2
Id=42666
PostTypeId=2
ParentId=42550
CreationDate=2008-09-03T21:43:15.383
Score=1
Body=
Aza Raskin has talked about recognising when selected text is an address in his <a href="http://www.azarask.in/blog/post/new-tabs/" rel="nofollow noreferrer">Firefox Proposal: A Better New Tab Screen</a>. No code yet, but I mention it as there may be code in firefox to do this in the future.
Alternatively, you could look at using the <a href="https://wiki.mozilla.org/Labs/Ubiquity/Ubiquity_0.1_User_Tutorial#The_Map_command" rel="nofollow noreferrer">map command in Ubiquity</a>, although you'd have to select the addresses yourself.
OwnerUserId=2541
OwnerDisplayName=Sam Hasler
LastActivityDate=2008-09-03T21:43:15.383
CommentCount=0
- 点赞
- 收藏
- 关注作者
评论(0)