MS.16. Metadata model, storage, and retrieval for Big Data

Metadata are data that describe other data, e.g., definitions for data store entities and attributes, identification of the relationships between data, and providing validation of data timeliness, accuracy and completeness. Traditional relational database systems and data warehouse systems offer metadata repositories and metadata management. Metadata to be used across the enterprise must be clear, complete, and easily interpreted. A well defined CWM metadata industry standard was developed for this purpose and it is supported by major commercial RDBMS vendors.

Metadata is important to any data management activity. They are essential for the purpose of:

  • support for data classification and understanding data semantics,
  • data lineage,
  • data sharing among applications,
  • faster querying - locate the right data,
  • maintaining data consistency,
  • data source discovery,
  • data integration and cleaning.

Metadata and metadata management become even more important when dealing with large, complex, heterogeneous, and often multisourced data sets, like Big Data. The main sources of this data include among others:

  • log files, e.g., database logs, network logs, OS logs,
  • mobile data, e.g., location data,
  • social network data,
  • public data (open data),
  • streaming data, e.g., stock exchange,
  • IoT data, e.g., sensor data of various kinds.

Big Data practitioners consistently report that 80% of the effort involved in data pre-processing is cleaning. One of the Gartner reviews, reveals that unstructured data represents about 80% companies' total data. Managing the ever-growing volume of unstructured data in an effective manner creates a competitive advantage of a company. Being able to effectively organize and categorize these data will ultimately deliver more intelligence into a business by enabling better and faster decision-making.

The most important changes that came with Big Data, such as Hadoop and other platforms, are that they are ‘schema-less’. This essentially means they are without an accurate description of what this data truly is. Unfortunately, Big Data does not have this availability of 'native' metadata. As a consequence, Big Data management systems do not provide an out-of-the-box metadata standard, repositories, or management.

Research problems related to Big Metadata management include among others:

  • automatic or semi-automatic metadata discovery and collection from new data sources,
  • a standard for representing metadata (e.g., RDF, HBase HCatalog),
  • efficient storage (building a metadata repository either in a centralized or distributed architecture),
  • cleaning and consistency management,
  • efficient search (e.g., SPARQL), analysis, and visualization (e.g., BPMN),
  • metadata visualization.