MS.16. Metadata model, storage, and retrieval for Big DataMetadata are data that describe other data, e.g., definitions for data store entities and attributes, identification of the relationships between data, and providing validation of data timeliness, accuracy and completeness. Traditional relational database systems and data warehouse systems offer metadata repositories and metadata management. Metadata to be used across the enterprise must be clear, complete, and easily interpreted. A well defined CWM metadata industry standard was developed for this purpose and it is supported by major commercial RDBMS vendors. Metadata is important to any data management activity. They are essential for the purpose of:
Metadata and metadata management become even more important when dealing with large, complex, heterogeneous, and often multisourced data sets, like Big Data. The main sources of this data include among others:
Big Data practitioners consistently report that 80% of the effort involved in data pre-processing is cleaning. One of the Gartner reviews, reveals that unstructured data represents about 80% companies' total data. Managing the ever-growing volume of unstructured data in an effective manner creates a competitive advantage of a company. Being able to effectively organize and categorize these data will ultimately deliver more intelligence into a business by enabling better and faster decision-making. The most important changes that came with Big Data, such as Hadoop and other platforms, are that they are ‘schema-less’. This essentially means they are without an accurate description of what this data truly is. Unfortunately, Big Data does not have this availability of 'native' metadata. As a consequence, Big Data management systems do not provide an out-of-the-box metadata standard, repositories, or management. Research problems related to Big Metadata management include among others:
|
|