Construction and Application of Data Standard in Big Data

举报
jhtchina 发表于 2019/08/09 12:57:45 2019/08/09
【摘要】 School informatization Construction has gone through 10 years, Multiple systems operate independently. Meanwhile, the role of unstructured data resources about security monitoring, smart card system,

Construction and Application of Data Standard in Big Data Environment


 

1st Author

Haitian Jia /s per 1st

Suzhou Institute of Trade & Commerce

Suzhou 215009, China

Email: 11804709@qq.com;
2nd Author

Chun Jia /s per 2nd

Yellow River Conservancy Technical Institute

Kaifeng 475004,China
Email: smy1161666@163.com;


From 2019 International Conference on Big Data Engineering


ABSTRACT

School informatization Construction has gone through 10 years, Multiple systems operate  independently. Meanwhile, the role of unstructured data resources about security monitoring, smart card system, online course resources has become more and more important in the construction of intelligent campus. According to the present condition of the school, this paper give a data model  for the information construction of colleges and universities. Solving the System Fusion put to use Hadoop distributed system Architecture between structured data and unstructured data, Providing basis for data analysis and decision-making. Big data will become the evolutive direction of Intelligent Campus in the next few years, It will promote the construction of school informatization about Construction and implementation. 

Keywords

Hadoop; Unstructure data ; Data Standard; Spark

1.     INTRODUCTION

The construction of big data in the construction of intelligent campus has become an important development direction. Data governance in order to discover the value of data has become a concern of all parties. This paper builds a model based on Hadoop and its ecosystem, and proposes the idea of Hadoop based distributed system infrastructure. The solution mainly involves data extraction and cleaning, Hadoop data storage and analysis, data standard construction and intelligent analysis and visualization. Big data construction will play a key role in the construction of smart campus. All implementation tools in this article are done using open source technology.

2.     THE CONSTRUCTION OF THE INTELLIGENT CAMPUS IN THE BIG DATA ENVIRONMENT

In the process of building intelligent campus, the data extraction and cleaning is mainly done by kettle or sqoop. The data warehouse adopts Hive and Hbase, data standard to construct Mysql database according to the Education Information Ministry Standard. Through Davinci to complete intelligent analysis and visual display. The composition of each part can be known from fig.1.

image.png                                              

fig. 1 Intelligent Campus Model of big data Environment

2.1     Data extraction and cleaning

In the process of building intelligent campus, the data extraction mainly realizes the data collection of the existing system data. The collection tool is completed by the Kettle tool, and the collected data is entered into the original database. The original database data including structured database Mysql and unstructured database Hive and Hbase. Hive and Hbase are all based on HDFS distributed file system which can connect unstructured data. The original database includes data from various structured data applications and unstructured applications (online behavior and attendance video).

After data collection, data cleaning is carried out according to the data elements defined in data standard. The cleaning work mainly includes (1) pre-processing stage (2) missing value cleaning (3) format content cleaning (4) logical error cleaning  (5) non-requirement structure cleaning (6) relevance verification.

2.2     Data warehouse model construction

Hbase and Hive are in different positions in Hadoop architecture. Hbase mainly solves the problem of real-time data query, while Hive mainly solves data processing and computing problems. Hbase is a database based on Hadoop and is a kind of  NoSQL database. Hive is a Hadoop data warehouse, which enables developers to compute and process structured data on HDFS via SQL, which is suitable for off-line bulk data calculation.

(1) data warehouse establishment and use processes:

(2) using Kettle tools to extract data sources to HDFS storage;

(3) cleaning through Hive to process and calculate the original data;

(4) the result of Hive cleaning, the large data random query scene can be stored in Hbase;

(5) data application from Hbase query data;

2.3     data standard

As an important part of the data center, the main data standards have GB/T 36342 / 2018 <wisdom Campus overall Framework> ,

JY/T1001—2012<Educational management general codes>, JY/T1002—2012<Educational management general information>,

JY/T1003—2012<Educational administration information>,

JY/T 1004—2012<Primary and secondary school management information>.

JY/T 1005—2012< Management information for vocational schools>, JY/T 1006—2012<Higher education management information> , JY/T 1007—2012<Educational statistical information>,

These constitute the standards of educational information.

These data standards are organized, converted from documents to data structures to complete the definition of metadata. The metadata item definition gives the representation structure (data item) of the data element, and defines the metadata structure of the data element.

2.4     Intelligent analysis and visualization

After the completion of the data standard construction, we enter the intelligent analysis and visualization of data. The visualization data mainly comes from two aspects: on the one hand, the structured standard data display, the docking data source is all kinds of structured database, On the other hand is unstructured data presentation, you can connect to the Hive data warehouse. The implementation of the visualization platform uses the open source DVAAS (Data Visualization as a Service) platform solution.

The process is to define the data source-> data view construction-> data visualization component binding-> interaction design-> integration capability design.

3.     KEY TECHNOLOGIES AND IMPLEMENTATION

The construction of intelligent campus under big data environment includes various technology applications of Hadoop ecosystem, including distributed system infrastructure Hadoop technology, data exchange technology-Kettle implementation, data bus technology-OpenESB,. Hadoop and its ecosphere technology-Hive\ Hbase\ Sqoop\ Oozie ,Spark fast and general computing engine, Hue\ Zeppelin data visualization technology, etc. This paper mainly introduces the key technologies of MapReduce and Spark and their implementation process.

3.1     MapReduce Programming model

MapReduce Is a programming model for parallel computing of large-scale data sets. MapReduce defines the following two abstract programming interfaces, Map and Reduce, which are implemented by user programming.

l   map: (k1; v1)  [(k2; v2)]

Input: data represented by key-value pairs (K1; v1)

Processing: a document data record, such as a row in a text file, or a row in a data table, is passed into the map function in the form of a "key-value pair"; The map function processes these key-value pairs and outputs a set of key-value pairs in the form of another set of key-value pairs [(K2; v2)]

Output: a set of intermediate data represented by the key-value pair [(K2; v2)]

l   reduce: (k2; [v2])  [(k3; v3)]

Input: a set of key-value pairs [(K2; v2)] output by map will be merged to merge different values under the same primary key into a list [v2], so the input to reduce is (K2; [v2])

Processing: collates or further processes the incoming intermediate result list data and produces the final form of result output [(k3; v3)].

Output: final output [(k3; v3)]

MapReduce is a set of distributed computing framework. In this programming model, tasks are decomposed into independent sub-problems. Distributed programming model. The four independent entities of MapReduce job execution .

l   client: writing mapreduce program, Configure the job, submit the job;

l   client: writing mapreduce program, Configure the job, submit the job;

l   JobTracker main thread, responsible for the client job submission, task scheduling to run on the work node, and provide monitoring the status of the work node, task progress management and other functions, each MapReduce has  a  Jobtracker, Running on hardware.

l   TaskTracker notifies JobTracker of its current monitoring status by periodic heartbeat, each heartbeat containing the number of map and reduce tasks available. JobTracker uses threads to handle heartbeat and user requests. When performing a task, TaskTracker can have multiple.

l   HDFS saves data jobs, configuration information, etc., and the results are also stored on hdfs.

IntelliJ IDE development process includes: JDK installation-> maven installation-> IDE environment settings.xml configuration get code warehouse jar package-> IntelliJ IDEA configuration maven- > maven WEB project-> configuration dependent jar package-> write code to complete development-> package run.

3.2     Spark  Cluster computing platform

Spark is a fast general purpose computing engine designed for large scale processing. It has the characteristics of high scalability, high fault tolerance and memory computing, and belongs to the BDAS ecosystem. The execution process of Spark include:

 (1) the connection between Driver and Master and the application of resources;

 (2) the scheduling of resources by Master;

 (3) Master and Worker perform RPC communication, and Worker starts Executor;

(4) Worker to start Executer;

(5) Executer to communicate with Driver;

(6) After the RDD (Resilient Distributed Dataset) elastic distributed data set is triggered into the Action, the dependency is inferred from the back to front according to the RDD, and the shuffle is encountered for segmentation.

(7) After the DAGScheduler splits the Stage, it submits the Stage first.

After the execution, after the Stage is submitted, the Stage will generate the Task. A Stage will generate a lot of Tasks with the same business logic, and then pass to the TaskScheduler in the form of TaskSet. The TaskScheduler will take the Task sequence. According to the resource situation, send it to Executor.

Intellij IDE Development Framework setup process: JDK installation and configuration->Scala installation and configuration-> Maven installation and configuration-> Intellij IDEA configuration Scala Plugin- > configure Scala SDK-> New maven Project-> pom.xml Import spark dependency-> write Spark Generation Code-> package run.

3.3     Standard data construction

Standardized metadata is built in tables, metadata is composed of data items, data items constitute data standards, the main data items include:

l   Number: the unique identification of the data item, using a globally uniform 5-paragraph 10-bit code;

l   Data item name: usually composed of Chinese alphabet (capital) of Chinese abbreviation, corresponding to Chinese abbreviation;

l   Chinese acronym: the name of the data element used, semantic, user-oriented;

l   Type: the data type that the data item holds;

l   Length: the maximum number of characters a data item can hold;

l   Constraint: a description of the state of a data item constraint;

l   Value space: the scope and specification of the value of the data item;

l   Explanation / example: description or example of data item attributes;

l   Reference number: indicates that this data item refers to the number of other defined data items;

implementation procedure

l   Passing parameters into the DBMS stored procedure;

l   Query metadata structure table for data type conversion;

l   Loop to read data, build standard DDL (Database definition language);

l   Delete the original table, and execute the DDL construction standard table;

3.4     Data skew and its processing methods

A large amount of data is allocated to a node because non-uniform data distribution. As a result, the data is Skewed.

Paper “Skew Handling for Outer Joins in a Cloud Computing Environment” consider: Author propose  Query-based Outer Joinsapproach has two different communication patterns - distribution and query, which occur between local and remote nodes. This distinguishes the method from the conventional hash-based and duplication-based outer joins.

Author apply a refined pattern, the query-based scheme, with full parallelism to outer joins on a distributed architecture and use it for handling skew directly.

Paper “A Coflow-based Co-optimization Framework for High-performance Data Analytics “ considerAuthor have chosen a very efficient method, partial duplication, in our implementations. Its core idea is: large number of skewed tuples in an input relation are kept locally and not transferred at all, instead, just a very small number of non-skewed tuples from another relation are broadcast to all other nodes.

We consider there is no fixed solution to data skew, so it is necessary to understand the distribution of data sets, then understand the operating mechanism and bottlenecks of the computing framework, and make specific optimization for specific situations.

4.     DATA Mining modeling

Data Mining and Analysis of ID3 algorithm based on decision Tree in Campus Card Environment. The functions of data mining and different pattern types include association analysis, classification and prediction, cluster analysis and outlier analysis. The typical algorithm for classification and prediction is decision tree. Common algorithms include ID3.

The structure of the standard data mining system includes: firstly, it is necessary to determine the mining object and its target, select the appropriate model and collect the data. Then the data is pre-processed in order to implement the selected data mining algorithm. Finally, the results of data mining are analyzed.

ID3 algorithm selects the Best Test attribute based on Information Entropy .Design of ID3 decision Tree algorithm:

The initial entropy is calculated as follows(1):

                                                                                   image.png

In formula : Ni is the number of samples in the class Ci; N is the number of samples; c is the number of categories.

In the process of attribute selection of the root node of the decision tree, the new entropy calculation method of the branch is as follows(2):

spacer.gif                 image.png

In formula : Ak indicates that the attribute; n kj is the number of samples per branch; n kj(i) is the number of samples belonging to category Ci in n kj samples of each branch.

The method of calculating the information gain is as follows(3):

spacer.gif                                                 image.png

The attribute Ak0 with the largest information gain is selected as the root of the decision tree. The judgment condition of Ak0 is(4):

spacer.gif image.png

Take student card example, make a qualitative and quantitative consumption comparison of ID3.As shown in Table 1According to the weather (good Good, bad Bad), weekend (is Is, no not), is promotion (is Is, no not),  based on the decision tree ID3 algorithm to judge the school students' consumption behavior, Make a decision whether you need to increase the supply (is yes, or not no).

number

weather

weekend

Promotion

1

Bad

Is

Is

2

Good

not

Is

3

Bad

not

not

4

Good

not

not

5

Good

not

not

……




N

Good

Is

not

Table 1 information gathering

Comparing the data information, the decision tree generated by the ID3 algorithm is shown in fig 2:

image.png


 fig. 2 Decision Tree generated by ID3

The design implementation idea is fig. 3:


image.png

fig. 3 Overall design flow chart of decision tree algorithm

5.     summary

Big data framework is based on the distributed application of Hadoop, the information structured data and unstructured data in colleges and universities are effectively stored, managed and calculated. This paper makes full use of the advantages of Hadoop and its ecosphere, designs and deploys according to the needs of university information, and realizes a set of university information platform based on the combination of big data Hadoop and Spark. The platform has a good ability to collect, process, analyze and display a large number of heterogeneous data. It provides important technical support and means to realize the portraits of teaching staff and students.

6.     REFERENCES

[1]     Long Cheng, Spyros Kotoulas. Efficient Skew Handling for Outer Joins in a Cloud Computing Environment. IEEE Transactions on Cloud Computing, 6(2): 558-571, 2018

[2]     Long Cheng, Ying Wang, Yulong Pei, Dick Epema. A Coflow-based Co-optimization Framework for High-performance Data Analytics. Proceedings of the 46th International Conference on Parallel Processing, pp. 392-401, 2017.

[3]     WANG Jingchun. Digital Campus Data Integration Architecture Analysis [J]. Journal of Changchun University of Science and Technology(Natural Science Edition). 2015(03) p. 148-151

[4]     Xuan Zu Guang, Wang Xiaoxia, Xu Xiaohui, Zhang Hai. Application of ODI in Massive Online Learning System Management [J]. Journal of Zhejiang Wanli University. 2015(02) p. 79-81

[5]     FENG She-miao. Study on Location Model of Aviation Logistics Transfer Center [J]. Railway Transport and Economy. 2015(01) p. 45-47

[6]     Tammaro A M, Ross S, Casarosa V. Research Data Curator: The Competencies GapJ.BOBCATSSS 2014 Proceedings, 2014, 1 (1): p.95 100

[7]     Kouper I. CLIR/DLF Digital Curation Postdoctoral Fellowship The Hybrid Role of Data Curator J. Bulletin of the American Society for Information Science and Technology, 2013, 39 (2):p. 46 47.

[8]     SU Qiuyue, CHEN Xingshu, LUO Yonggang. Access control model for multi-source heterogeneous data in big data environment  [J]. Network and Information Security, 2019(01): p.78-86.

[9]     ZHAO Ya'nan LI Chaokui XIAO Keyan FAN Jianfu. Research on distributed storage method of geological and mineral big data based on Hadoop [J].  Geological Bulletin, 2019(Z1): p.462-470.

[10]  SHI Min LU Dan-hai QIN Ting. Research on big data platform for college students analysis and service [J]. Information Technology, 2019(02): p.5-10.

[11]  WU Zheng. The Construction on the Provincial Water Conservancy Data Center Based on Big Data  [J]. Journal of Fujian Computer, 2019(02) : p.84-86

[12]  REN Hua ZHANG Ling YE Yu. Analysis and Monitoring of Big Data of Users Network Behavior in Digital Campus  [J].Computer & Digital Engineering,2017(09):p.1814-1818+1823

[13]  ZHANG Shuhua. Design of sports achievement data mining and physical fitness analysis system based on ID3 algorithm  [J]. Modern Electronics Technique,2019(05):p.104-110

[14]  PENG Cheng WEN Yu LI Chu-chang. Medical big data based on decision Tree algorithm  [J]. Information Technology and Informatization  ,2018(09):p. 70-74


 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 




【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。