CalHHS Data Knowledge Base
CalHHS Open Data PortalCalHHS Geoportal
  • Data Knowledge Base
  • Data Sharing
    • Revision History
    • Data Sharing Guidebook
    • Lessons Learned
    • Data Sharing Plays
      • Play 1: Sharing Metrics
      • Play 2: Identify
      • Play 3: Business Case
      • Play 4: Prioritize
      • Play 5: Metadata
      • Play 6: Describe
      • Play 7: Promote
      • Play 8: Prepare
    • Data Element Definitions
    • Application Program Interfaces
    • Additional Training and Reference Materials
    • Business Case Creation
      • Determining Goals and Strategy
      • Implementation Details
      • Evaluating Outcomes & Impacts
      • Communicating Your Results
  • Data De-Identification
    • Revision History
    • 1. Purpose
    • 2. Background
    • 3. Scope
    • 4. Statistical De-Identification
      • 4.1 Personal Characteristics of Individuals
      • 4.2 Numerator - Denominator Condition
      • 4.3 Assess Potential Risk
      • 4.4 Statistical Masking
      • 4.5 Legal Review
      • 4.6 Departmental Release Procedure for De-Identified Data
    • 5. Types of Reporting
      • 5.1 Variables
      • 5.2 Survey Data
      • 5.3 Budgets and Fiscal Estimates
      • 5.4 Facilities, Service Locations and Providers
      • 5.5 Mandated Reporting
    • 6. Justification of Thresholds Identified
      • 6.2 Assessing Potential Risk – Publication Scoring Criteria
      • 6.3 Assessing Potential Risk – Alternate Methods
      • 6.4 Statistical Masking
    • 7. Approval Process
    • 8. DDG Governance
    • 9. Publicly Available Data
    • 10. Development Process
    • 11. Legal Framework
    • 12. Abbreviations and Acronyms
    • 13. Definitions
    • 14. References
    • Appendix A: Expert Determination Template
    • Appendix B: 2015 HIPAA Reassessment Results
    • Appendix C: State and County Population Projections
  • Open Data Handbook
    • Revision History
    • Open Data: Purpose
    • Disclosure
    • Governance
    • Guidelines
    • Use
  • Appendix
    • Glossary and Acronyms
    • Data Tools
    • Data Discovery Sessions
    • Data Sharing Benefits
Powered by GitBook
On this page
  • Variables
  • Survey Data
  • Budgets and Fiscal Estimates
  • Facilities, Service Locations and Providers
  • Mandated Reporting

Was this helpful?

Export as PDF
  1. Data De-Identification

5. Types of Reporting

CalHHS programs develop a wide range of information based on different types of data. This is reflected in the various categories shown on the entry page for the CalHHS Open Data Portal, which include:

  • Diseases and Conditions

  • Facilities and Services

  • Healthcare

  • Workforce

  • Environmental

  • Demographics

  • Resources

Various types of reporting may or may not have a connection to personal characteristics that would create potential risk of identifying individuals.

Variables

The following list of variables is important to consider when preparing data for release.

  • Age

  • Sex

  • Race

  • Ethnicity

  • Language Spoken

  • Location of Residence

  • Education Status

  • Financial Status

  • Number of events

  • Location of event

  • Time period of event

  • Provider of event

As stated previously, variables that are personal characteristics may be used to determine a person’s identity or attributes. When these characteristics are used to confirm the identity of an individual in a publicly released data set, then a disclosure of an individual’s information has occurred. Individual uniqueness in the released data and in the population is a quality that helps distinguish one person from another and is directly related to re-identification of individuals in aggregate data. Disclosure risk is a concern when released data reveal characteristics that are unique in both the released data and in the underlying population. The risk of re-identifying an individual or group of individuals increases when unique or rare characteristics are “highly visible”, or otherwise available without any special or privileged knowledge. Unique or rare personal characteristics (e.g., height above 7 feet) or information that isolate individuals to small demographic subgroups (e.g., American Indian Tribal membership) increase the likelihood that someone can correctly attribute information in the released data to an individual or group of individuals.

Variables that are event characteristics are often associated with publicly available information.

Therefore, increased risk occurs when personal characteristics are combined with enough granularity with event characteristics. One could argue that if no more than two personal characteristics are combined with event characteristics then the risk will be low independent of the granularity of the variables. This hypothesis will need to be tested using various population frequencies to quantify the uniqueness of the combination of variables both the in the potential data to be released as well as in the underlying population.

Survey Data

Survey data, often collected for research purposes, are collected differently than administrative data and these differences should be considered in decisions about security, confidentiality and data release.

Administrative data sources (non-survey data) such as: vital statistics (e.g. births and deaths), healthcare administrative data (e.g. Medi-Cal utilization; hospital discharges), reportable disease surveillance data (e.g. measles cases) contain data for all persons in the population with the specific characteristic or other data elements of interest. Most of the discussions in this document pertain to these types of data.

On the other hand, surveys (e.g. the California Health Interview Study) are designed to take a sample of the population, and collect data on characteristics of persons in the sample, with the intent of generalizing to gain knowledge suggestive of the whole population.

The sampling methodology developed for any given survey is generally developed to maximize the sample size with the available resources while making the sample as un- biased (representative) as possible. These sampling procedures that are a fundamental part of surveys generally change the key considerations for protection of security and confidentiality. In particular, the main “population denominator” for strict confidentially considerations remains the whole target population, not the sampled population. But, if persons have special or external knowledge of the sampled populations (e.g. that a family member participated in the survey), further considerations may be required. Also, it is in the context of surveys that issues of statistical reliability often arise—which are distinct from confidentially issues, but often arise in related discussions.

Of particular note, small numbers (e.g. less than 11) of individuals reported in surveys do not generally lead to the same security/confidentiality concern as in population-wide data, and as such should be treated differently than is described within the Publication Scoring Criteria and elsewhere. In this case a level of de-identification occurs based on the sampling methodology itself.

Budgets and Fiscal Estimates

Budget reporting may include both actuals and projected amounts. Projected amounts, although developed with models that are based on the historical actuals, reflect activities that have not yet occurred and, therefore, do not require an assessment for de-identification. Actual amounts do need to be assessed for de-identification. When the budgets reflect caseloads, but do not include personal characteristics of the individuals in the caseloads, then the budgets are reflecting data in the Providers and Health and Service Utilization Data circles of the Figure 2 Venn Diagram and do not need further assessment. However, if the actual amounts report caseloads based on personal characteristics, such as age, sex, race or ethnicity, then the budget reporting needs to be assessed for de-identification.

Facilities, Service Locations and Providers

Many CalHHS programs oversee, license, accredit or certify various businesses, providers, facilities and service locations. As such, the programs report on various metrics, including characteristics of the entity and the services provided by the entity.

  1. Characteristics of the entity are typically public information, such as location, type of service provided, type of license and the license status.

  2. Services provided by the entity will typically need to be assessed to see if the reporting includes personal characteristics about the individuals receiving the services. Several examples are shown below.

    1. Reporting number of cases of mental illness treated by each facility – if the facility is a general acute care facility then the reporting of the number of cases does not tell you about the individuals receiving the services.

    2. Reporting number of cases of mental illness treated by each facility – if the facility is a children’s hospital then the reporting of the number of cases does tell you about the individuals receiving the services.

    3. Reporting number of psychotropic medications prescribed by a general psychiatrist does not tell you about the patients receiving the medications.

    4. Reporting number of psychotropic medications prescribed by a general psychiatrist to include the number of medications prescribed by the age group, sex or race/ethnicity of the patients receiving the medications does tell you about the patients receiving the medications.

In (a) and (c) above, assessment for de-identification is not necessary as there are no characteristics about the individuals receiving the services. However, in (b) and (d) above, the inclusion of personal characteristics which may be quasi-identifiers, especially when combined with the geographical information about the provider, does require an assessment for de-identification.

Mandated Reporting

CalHHS programs are required to provide public reporting based on federal and California statute and regulations, court orders, and stipulated judgments, as well as by various funders. Although reporting may be mandated, unless the law expressly requires reporting of personal characteristics, publicly reported data must still be de-identified to protect against the release of identifying or personal information which may violate federal or state law.

Previous4.6 Departmental Release Procedure for De-Identified DataNext5.1 Variables

Last updated 4 months ago

Was this helpful?