CalHHS Data Knowledge Base
CalHHS Open Data PortalCalHHS Geoportal
  • Data Knowledge Base
  • Data Sharing
    • Revision History
    • Data Sharing Guidebook
    • Lessons Learned
    • Data Sharing Plays
      • Play 1: Sharing Metrics
      • Play 2: Identify
      • Play 3: Business Case
      • Play 4: Prioritize
      • Play 5: Metadata
      • Play 6: Describe
      • Play 7: Promote
      • Play 8: Prepare
    • Data Element Definitions
    • Application Program Interfaces
    • Additional Training and Reference Materials
    • Business Case Creation
      • Determining Goals and Strategy
      • Implementation Details
      • Evaluating Outcomes & Impacts
      • Communicating Your Results
  • Data De-Identification
    • Revision History
    • 1. Purpose
    • 2. Background
    • 3. Scope
    • 4. Statistical De-Identification
      • 4.1 Personal Characteristics of Individuals
      • 4.2 Numerator - Denominator Condition
      • 4.3 Assess Potential Risk
      • 4.4 Statistical Masking
      • 4.5 Legal Review
      • 4.6 Departmental Release Procedure for De-Identified Data
    • 5. Types of Reporting
      • 5.1 Variables
      • 5.2 Survey Data
      • 5.3 Budgets and Fiscal Estimates
      • 5.4 Facilities, Service Locations and Providers
      • 5.5 Mandated Reporting
    • 6. Justification of Thresholds Identified
      • 6.2 Assessing Potential Risk – Publication Scoring Criteria
      • 6.3 Assessing Potential Risk – Alternate Methods
      • 6.4 Statistical Masking
    • 7. Approval Process
    • 8. DDG Governance
    • 9. Publicly Available Data
    • 10. Development Process
    • 11. Legal Framework
    • 12. Abbreviations and Acronyms
    • 13. Definitions
    • 14. References
    • Appendix A: Expert Determination Template
    • Appendix B: 2015 HIPAA Reassessment Results
    • Appendix C: State and County Population Projections
  • Open Data Handbook
    • Revision History
    • Open Data: Purpose
    • Disclosure
    • Governance
    • Guidelines
    • Use
  • Appendix
    • Glossary and Acronyms
    • Data Tools
    • Data Discovery Sessions
    • Data Sharing Benefits
Powered by GitBook
On this page
  • Play 5: Establish Your Metadata Repository
  • Play 5.1: Identify Metadata Requirements
  • Play 5.2: Implement a Data Catalog and Metadata Repository
  • Play 5.3: Create a Business Glossary

Was this helpful?

Export as PDF
  1. Data Sharing
  2. Data Sharing Plays

Play 5: Metadata

PreviousPlay 4: PrioritizeNextPlay 6: Describe

Last updated 4 months ago

Was this helpful?

Play 5: Establish Your Metadata Repository

For data to be valuable, it must be understandable by data consumers and your internal teams. or “data that provides information on other data1” provides the and information needed for consumers to work with your data. Metadata also provides input to complete the BUCP Technical Fields and support analysis for the Specialized Security and Specialized Privacy fields. Metadata contains the following types of information:

  • Business Definitions

  • Technical Information

  • Security Classification

  • Governing Statutes

The term “metadata” may not be familiar to your data consumers. When explaining the purpose of your metadata repository it may be more effective to use common language terms or descriptions such as “data definitions” in place of “metadata”.

This diagram depicts your metadata repository and its consumers across your department’s various teams:

Metadata Repository Consumers

Typically, metadata is stored in a repository and presented to data consumers in the form of a data dictionary. A data dictionary helps data recipients, such as departments receiving your data, understand the meaning of individual data elements.

Once your metadata repository is populated, you will have a data dictionary for data consumers to understand the meaning of individual data elements and support the following components of BUCPs:

  • The content for the Technical Fields section.

  • List of governing security statutes for analysis to complete the Specialized Security and Privacy Fields.

  • List of governing statutes to understand data sharing requirements and limitations.

Your metadata repository also provides direct benefits to your department that are listed below and elaborated in the Guidebook’s supplemental section Benefits to Your Department from Executing the Plays:

  • Internal data-sharing

  • System changes

  • Impact assessments

  • Level of effort estimates

  • Internal reporting

  • Verifying security controls

  • API and data set reuse

We recommend launching the effort to create your metadata repository in parallel with the data prioritization effort described in Play 4: Prioritize Your Data. Conducting these efforts in parallel provides time for the implementation of a data catalog product or for technology teams to implement an alternate solution using an existing database.

Describing Your Application Program Interfaces

The techniques to improve data-sharing provided by this Guide are relevant to data stored in databases and Application Program Interfaces (API). If your data-sharing improvement effort includes API, please review the Guidebook’s supplemental section Describing Application Program Interfaces for guidance on related metadata standards and practices.

Play 5.1: Identify Metadata Requirements

Your first step is to identify the type of information to be captured in your dataset’s metadata. Typically, metadata is captured at the field or attribute level and includes the following types of information:

  • Field Name

  • Field Label

  • Data Type

  • Field Definition

  • Valid Inputs (e.g., List of Values, Picklists)

  • Source

  • Security and Privacy Information (e.g., Personal Health Information (PHI) or Personally Identifiable Information (PII), Related Statutes)

  • Related Governing Statutes or Codes

The attached metadata template provides an example of the types of metadata to collect. You can tailor the metadata attributes in this example spreadsheet to your department’s needs.

Collect sufficient data to convey the meaning of your data and analysis during BUCP approvals. Remove metadata attributes that are not relevant to your datasets to reduce metadata collection and maintenance efforts.

Additional examples of data dictionary templates are listed below:

The next step is to identify the technical platforms that store your datasets for use during your metadata repository selection process to identify technical attributes to include in your metadata repository. These are typically your department’s databases but can also include Application Program Interfaces (APIs) or datasets created by your analytics platforms. Use the dataset inventory you created in Play 2: Identify Your Data to identify technology platforms.

Play 5.2: Implement a Data Catalog and Metadata Repository

Depending on available resources, options to create a data catalog and metadata repository include:

  • Commercially licensed software packages

  • Open-source software packages

  • Use a database to capture metadata

  • Spreadsheets

Open-source and commercial metadata repository platforms include features for automated data element collection and web-based access. Metadata repository platforms save time in the maintenance of your data dictionary and improve access to metadata. Data catalog platforms also make it easier to manage custom metadata, including references to applicable statutes that govern data sharing. Data catalog platforms also promote the use of your data-sharing artifacts by improving access through web access and search functions.

Use the requirements you created in Play 5.1: Identify Metadata Requirements to evaluate the identified options and make a platform selection. If you do not have budget or staff resources available to establish a data catalog platform, you can get started using existing tools such as your department’s databases, spreadsheets, and a collaboration platform (e.g., Microsoft Teams).

Spreadsheets avoid software license and infrastructure costs; however, they are time-intensive to maintain and use for large datasets. If funds are unavailable to support a commercial platform or host an open-source option, another strategy is to use one of your existing databases as a metadata repository. Most database platforms provide the ability to store descriptions and security classifications-based metadata. Using your database platform’s metadata features lets you:

  • Integrate ongoing creation and updates of metadata into your development processes.

  • Create reports for data consumers using SQL statements.

  • Establish a source of metadata for Data Dictionary platforms.

Database platform metadata capabilities may only track a limited quantity of metadata fields. You will need to develop an approach to address such limitations. For example, your approach may need to combine security and legal statute references into a single field.

Later in the Play 5 vignette, we provide an example of using a database as a metadata repository.

Play 5.3: Create a Business Glossary

Data recipients who receive your metadata may have difficulty understanding your department’s use of specific terms and specialized meanings.

At a high level, a business glossary establishes a record of your department's or program's jargon-specific use of terms and acronyms. Some examples of items to include in your business glossary include:

  • Terms specific to your department

  • Specialized use of terms

  • Acronyms specific to your department

You may already have the beginnings of a business glossary in your new employee onboarding library.

Spreadsheets are a simple mechanism to store your department’s business glossary if a data catalog platform is not available. A website improves accessibility if your department receives frequent data requests or actively publishes data on open data portals.

Play 5.3 Vignette: CDW Creates a Metadata Repository and Business Glossary

If your department has elected to implement a metadata repository tool or you are electing to use a spreadsheet, feel free to skip this vignette.

The CDW lacks the funds to purchase a data catalog for its metadata repository tool. When the data-sharing improvement effort is successful, Sally will try to secure funding for a data cataloging tool in the next fiscal year.

Carlos’ background as a database administrator helps solve the funding problem. He suggests using tools CDW already owns, including the department’s database and collaboration platform, Microsoft Teams. First, the CDW will use a combination of spreadsheets and description fields in their databases to collect and store their metadata. These tools provide automation that help scale the effort for the rest of the department’s datasets. Carlos’ metadata management process is depicted below:

The CDW Creates Its Metadata Repository Technology Platform

The high-level metadata collection and storage steps are as follows:

  1. Create the Initial Data Element Inventory: Carlos runs a Structured Query Language (SQL) to extract the list of fields from the W2W database in a Comma Separated Value (CSV) format.

    1. Carlos places the file on the CDW’s collaboration platform, Microsoft Teams. The collaboration platform lets the CDW staff add metadata to a central file.

  2. Collect the W2W Metadata: The CDW staff add supporting metadata, including:

    1. Program SMEs add functional descriptions to the W2W fields to describe their business meaning.

    2. The CDW security team reviews the elements and their definitions to create security classifications for the W2W data. The security team creates a set of standard classifications to help identify the security requirements for shared data, including:

      1. Personal Identifiable Information (PII)

      2. Protected Health Information (PHI)

      3. CDW Institution Codes Related to Data Security

  3. Load the Metadata: The CDW database administrators create SQL-based Data Definition Language (DDL) statements that load the definitions into the W2W database. Carlos makes an Excel macro to automatically generate the DDL from the spreadsheet used to collect metadata.

  4. Extract the Metadata: The W2W development team adds a step to their process to include definitions whenever they make database changes. This process keeps the W2W metadata complete and current. As the W2W database is modified, Carlos runs the SQL statement from Step 1 to extract the metadata.

  5. Publish Metadata: Carlos places the resulting CSV in Microsoft Teams. The W2W database is the source of truth for metadata. The metadata spreadsheet in Teams is read-only to ensure all metadata changes are performed through the CDW development processes. He sends an email to notify data consumers that a new version of the data dictionary is available.

Carlos documents the plan for review with the W2W development and database teams. Sally schedules a meeting with the W2W development lead and its DBA to review the plan. During the meeting, Sally and Carlos use the following agenda:

  • Departmental Objectives

  • Executive Support

  • Benefits to the W2W Technical Teams

  • Metadata Repository Approach

  • Support Needs During Metadata Collection

The W2W IT team lets Carlos know that they recently upgraded their database version. The new database version provides standard metadata fields for descriptions and security classifications.

Sally, Carlos, and the W2W development team agree on the approach and scope of a proof-of- concept to test its viability and gather lessons learned to improve the process. The group also agrees on how the metadata collection team will engage IT resources. The agreement helps ensure the IT staff can complete their daily job duties while supporting the metadata collection effort.

Sally decided to use a simple spreadsheet stored in Microsoft Teams for the CDW business glossary. The business glossary will be incrementally populated as the CDW data team populates the metadata repository.

The Guidebook supplemental section, Example Metadata Repository Tools, provides examples of data catalog tools, including those available (e.g., AWS, Azure, Google) via the State of California contract vehicles.

A business glossary extends context and understanding of your data definitions. A business glossary defines business concepts that are required to understand your data definitions. An example of a web-based business glossary is available . Your business glossary can be a worksheet in your metadata package transmitted to data recipients.

CDT Data Dictionary Template
California Open Data Publisher’s Handbook
Off-Premises Cloud
here
Metadata
technical, business,
security