18. Appendix F: Publicly Available Data

A critical step in reviewing data for public release is the consideration of what other data may be publicly available that could be used in combination with the newly released data to identify the individuals represented in the data. This section will highlight some specific datasets that are publicly available that may be used in combination with CalHHS data that would contribute to potential increased risk.

Common kinds of data with personal information include: real estate records, individual licensing databases (MD, RN, contractors, lawyers, etc.), marriage records, news (and other) media reports, commercially available databases (data brokers, marketing), court documents, etc.

18.1 Vital Records Data

Another common dataset for programs to be aware of are the publicly available electronic birth and death indices from Vital Records, as specified in Health and Safety Code section 102230(b).

The following are provided in the birth record indices:

  • First, middle, and last name

  • Sex

  • Date of birth

  • Place of birth

The following are provided in the death record indices:

  • First, middle, and last name

  • Sex

  • Date of birth

  • Place of birth

  • Date of death

  • Place of death

  • Father’s last name

Other potential sources of publicly available data to consider are informational certified copies of birth and death certificates. In California, anyone can obtain an informational certified copy of birth and death certificates, which are clearly marked as un-authorized copies that cannot be used to verify identity. In reality, it is difficult to use these as a dataset for the following reasons:

  • Certified copies of birth and death certificates must be obtained on an individual basis, and you must be able to identify the record. In other words, an individual cannot simply ask for a stack of certificates for purposes of creating a dataset.

  • Certified copies are issued on specialized banknote paper, not in electronic format, which creates a problem of scale when trying to create a dataset.

  • There is a $29 fee for each certified copy of a birth certificate and $24 fee for a certified copy of a death certificate, which also creates a problem of scale when trying to create a dataset.

  • Certified copies are meant for individual use. A request for a large number of certificates may generate an investigation among vital records staff as to why so many certificates were requested at once.

18.2 CalHHS Open Data Portal

As additional data sets are added to the Open Data Portal, programs need to take that information into account when considering potential risk for any given dataset. The CalHHS Open Data Workgroup will be providing easier access to both lists of data currently on the portal as well as datasets planned for addition to the portal. While significant with over 100 datasets, this is not exhaustive because of the PRA, which allows for an extremely broad amount of information to be released in a sporadic way. So, some specificity can occur but not completely. CalHHS departments have a duty of due diligence in the de-identification process regarding consideration of published identifiable data, published de-identified data, and the soon to be published de-identified data. Additional information that addresses the balance of transparency and privacy include the Berkman Klein Center for Internet & Society’s “Open Data Privacy Playbook”.

Listed below are examples of individual records or documents that the Department of Rehabilitation have available to the public:

  • Fair Hearing Decisions include the appellant’s initials and possibly other information, depending on the issue the appellant presents for hearing, such as sex, disability, employment, education, vocational rehabilitation services, etc.; and

  • Monthly Operating Reports and information therefrom includes names of licensees and financial information regarding the operation of the licensees’ operation of vending facilities in the Business Enterprises Program for the Blind. To be eligible for this program, the individuals must be legally blind.

18.3 Public Census and Demographic Information

The Demographic Research Unit (DRU) of the California Department of Finance is designated as the single official source of demographic data for state planning and budgeting. The DRU produces the following products which serve as the basis for understanding the population characteristics and distributions that frequently make up the denominators in the review of datasets.

  • Estimates - Official population estimates of the state, counties and cities produced by the Demographic Research Unit for state planning and budgeting.

  • Projections - Forecasts of population, births and public school enrollment at the state and county level produced by the Demographic Research Unit.

  • State Census Data Center - Demographic, social, economic, migration, and housing data from the decennial censuses, the American Community Survey, the Current Population Survey, and other special and periodic surveys.

18.4 Commonly Shared Information

With the growth of social media, people frequently share information through tools such as Facebook, LinkedIn, Instagram, TikTok, YouTube, X (formerly Twitter), dating apps, and AI platforms such as Chat GPT, Open AI, and Google Cloud Vertex AI. While it would be impossible to take into account all information that people make public about themselves, there is an expectation that a certain amount of information is likely to be in the public domain based on information individuals frequently provide about themselves. Examples of such information include wedding dates, birth dates, education (high school, college), and professional certifications.

18.5 Geographic Information

Geographic information is particularly suited to being combined with other geographic information given the relatively standardized way data is coded (latitude, longitude, county, etc.) With the use of mapping tools, various information can be combined in a way that is called a “mash up.” “A mashup, in web development, is a web page, or web application, that uses content from more than one source to create a single new service displayed in a single graphical interface. For example, you could combine the addresses and photographs of your library branches with a Google map to create a map mashup. The term implies easy, fast integration, frequently using open application programming interfaces (open API) and data sources to produce enriched results that were not necessarily the original reason for producing the raw source data.”

18.6 Artificial Intelligence

With the rapid advancement in the use of health care data for artificial intelligence, machine learning tools, and data-generative models, there may be an increased risk of reidentification. These technologies enable the processing of large volumes of complex, unstructured raw data, but issues of liability and accountability remain unclear and unaddressed. Continued risk assessment is an important aspect of the guidelines presented in the DDG.

Last updated

Was this helpful?