4.3 Assess Potential Risk
This step requires the use of a documented method to assess the risk that small numerators or small denominators may result in conditions that put individuals at risk of being re-identified.
Assessment of potential risk for a given dataset must take into account a range of contributing considerations. This includes understanding particular characteristics of a given dataset that is being released. For example, if the potential values for a specific personal characteristic, such as race, results in many small numbers in dataset A but does not in dataset B, then the risk may be low for dataset B and high for data A if the groupings of the personal characteristics include the same categories. For this reason, each department or program may set different values for risk based on the underlying distribution of these variables in the datasets of interest.
There are many methods used to assess potential risk. Many of the methods that are in use throughout the country are described in the various references provided in Section 11. While each department will document the method(s) chosen for use, all the CalHHS departments are directed to use the following description of the Publication Scoring Criteria as an example and as a method to assess potential risk.
4.3.1 Publication Scoring Criteria: Example of tool to assess potential risk
The Publication Scoring Criteria is used to identify the presence of small values that are considered sensitive in order to facilitate the assessment of potential risk. The Publication Scoring Criteria combines a number of conditions that increase the risk of a given data table and allows the department to evaluate those risks in combination with each other. The variables included in the Publication Scoring Criteria are those variables routinely used to publish data but are not all inclusive. Explanations for the risk values assigned to variables can be found in Section 16, Appendix D. Section 16.2.14 addresses how to account for other variables that are not included in the Publication Scoring Criteria.
A variable is a symbol representing an unknown numerical or categorical value in an equation or table. A given variable may have different ranges assigned to it. Ranges assigned to the variable may be defined in many ways which may increase or decrease the risk of identification of an individual represented in the table. This is seen in the Publication Scoring Criteria in that ranges for variables which will produce smaller groupings have a higher score.
The Publication Scoring Criteria in Figure 6 quantifies with a score two identification risks: size of potential population and variable specificity. The Publication Scoring Criteria is used to assess the need to perform statistical masking as a result of a small numerator, small denominator, or both. The Publication Scoring Criteria takes into account both variables associated with numerators, such as Events, and with denominators, such as Geography or Insurance Coverage.
This method requires a score less than or equal to 12 for the data table to be released without additional masking of the data. Any score over 12 will require the use of statistical masking methods described in Section 4.4 or documentation regarding the specific characteristics of the dataset that mitigate the risk.
When identifying the score for each variable, use the highest scoring criteria. For example, if a table had age groups of 0 to 11 years, 12 to 14 years, and 15 to 18 years then the score for the “age range” variable would be +5 because the smallest age range is 12 to 14, which is an age range of three years.
If a variable has greater granularity than the score listed, use the highest score listed. For example, if the variable “Time” has a frequency of “weekly” then the score would be +5 which is the maximum score associated with the most granular level (monthly) of the variable in the Publication Scoring Criteria.
In addition to assessing the granularity of each variable, the interaction of the variables is also important. As discussed later in Section 4.4, decreasing the granularity or the number of variables are both techniques for increasing the values for the numerators. The final criteria in Figure 6 are those for Variable Interactions. This provides for subtraction of points if the only variables presented are the events (numerator), time, and geography, and an addition of points for including more variables in a given presentation. With respect to the subtraction of points, the score is based on the minimum value for the Events variable. For example, if the smallest value for the Events is 5 or more, then the score would be -5. However, if the smallest value for the Events is 2, then the score would be 0. This is discussed in more detail in Section 16.2.
In assessing risk, scoring can be part of the justification to release or not release data but should not by itself be an absolute gateway to the release of data. The review must take into account additional considerations including those that are discussed in this document in addition to the scoring.
Figure 6: Publication Scoring Criteria Tables by Variable
Events (Numerator)
1000+ events in a specified population
+2
100-999 events
+3
11-99 events
+5
<11 events
+7
Age Range
>29-year age range
+1
11-29 year age range
+2
6-10 year age range
+3
3-5 year age range
+5
1-2 year age range
+7
Race or Race/Ethnicity
The following two tables can be used for data that complies with current OMB standards (which combines race/ethnicity into one variable) and data that complies with previous 1997-2024 OMB standards (which separated race and ethnicity into two variables) as the risk assessment is the same for both.
Race or Race/Ethnicity Combined
White, Asian, Black or African American, Hispanic or Latino, Middle Eastern or North African
+2
White, Asian, Black or African American, Hispanic or Latino, Middle Eastern or North African, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Mixed
+3
Detailed Race or Race/Ethnicity Combined
Detailed Race or Race/Ethnicity Combined with Population >4,000,000
e.g., Mexican
+1
Detailed Race or Race/Ethnicity Combined with Population 300,001 – 4,000,000
e.g., Chinese, Filipino, German, Asian Indian, Italian, Korean, Salvadoran, Guatemalan
+2
Detailed Race or Race/Ethnicity Combined with Population 100,001 – 300,000
e.g., Japanese, Armenian, Iranian, Aztec, Portuguese, Taiwanese, Hmong, Puerto Rican, Peruvian
+3
Detailed Race or Race/Ethnicity Combined with Population 20,001 – 100,000
e.g., Cambodian, Dutch, Pakistani, Egyptian, Thai, Maya, Afghan, Nigerian, Indonesian, Fijian, Native Hawaiian, Jamaican, Cuban, Colombian, Argentinean
+5
Detailed Race or Race/Ethnicity Combined with Population ≤20,000
e.g., Tongan, Chamorro, Bangladeshi, Sri Lankan, Brazilian, Mixtec, Kenyan, Zapotec, Malaysian, Belizean, Chumash, Sudanese, Pomo, Inca, Pipil
+7
Ethnicity
Use the following two tables to assess risk for the ethnicity variable, which will be present in data that follows pre-2024 OMB standards.
Ethnicity Only
Hispanic or Latino - yes or no
+1
Detailed Ethnicity
Detailed Ethnicity with Population >4,000,000
e.g., Mexican
+1
Detailed Ethnicity with Population 300,001 – 4,000,000
e.g., Salvadoran, Guatemalan, Central American, South American
+2
Detailed Ethnicity with Population 100,001 – 300,000
e.g., Puerto Rican, Spaniard, Peruvian, Nicaraguan, Honduran
+3
Detailed Ethnicity with Population 20,001 – 100,000
e.g., Cuban, Colombian, Argentinean, Dominican, Panamanian
+5
Detailed Ethnicity with Population ≤20,000
e.g., Bolivian, Uruguayan, Paraguayan
+7
Language Spoken
English, Spanish, Other Language
+1
Detailed Language with Population 300,001 – 4,000,000
e.g., Chinese, Tagalog, Vietnamese, Korean
+2
Detailed Language with Population 100,001 - 300,000
e.g., Persian, Hindi, Arabic, Russian, Japanese, French
+3
Detailed Language with Population 20,001 - 100,000
e.g., German, Portuguese, Hmong, Hebrew, Bengali, Polish
+5
Detailed Language with Population ≤20,000
e.g., Haitian, Navajo
+7
Sex, Sexual Orientation, and Gender Identity
Variable
Characteristic
Score
Sex
Male or Female
+1
Sexual Orientation
Straight, Gay or Lesbian, Bisexual, Asexual
+2
Gender Identity
Man/Male, Woman/Female, Transgender or Non-Binary
+3
Gender Identity
Man/Male, Woman/Female, disaggregation of Transgender/Non-Binary category into more specific identities (e.g., Genderqueer, Two-Spirit, etc.)
+5
Intersex
Intersex (asked as separate question)
Yes or No
+2
Intersex (combined with Sex question)
Male, Female, Intersex
+2
Immigration Status
U.S. Citizen, Foreign Born (combines Naturalized Citizen and Noncitizen)
+1
U.S. Citizen, Naturalized Citizen, Noncitizen
+1
Detailed Immigration Status with Disaggregation of Noncitizen Statuses - Refer to High-Risk Populations (Section 5.6.2)
N/A
Insurance Coverage
Use the following table when reporting by insurance coverage, such as by health plan. See Appendix I for more details on scoring scenarios involving the overlap of Insurance Coverage, Expected Payer/Public Assistance and Means-Tested Programs, and Geography. Below are three key points that summarize all the scenarios:
If the data is ONLY related to Residence or Service Geography, then DO NOT USE Insurance Coverage or Means-Tested Tables.
Means-Tested Programs—Only add interaction if enrollment in the Public Assistance program is 10 million or fewer people. No interaction is needed for Medi-Cal as the current enrollment is approximately 14 million, which exceeds 10 million.
If the number of members enrolled in Insurance Coverage is less than the population of the geographic subdivision, then use the Insurance Table. If the number of members enrolled in Insurance Coverage is greater than or equal to the population of the geographic subdivision, then use the Geography Table.
Coverage with >2,000,000 members
-5
Coverage with 1,000,001 - 2,000,000 members
-3
Coverage with 560,001 - 1,000,000 members
-1
Coverage with 250,001 - 560,000 members
0
Coverage with 100,001 - 250,000 members
+1
Coverage with 50,001 - 100,000 members
+3
Coverage with 20,001 - 50,000 members
+4
Coverage with ≤ 20,000 members
+5
Expected Payer/ Public Assistance and Means-Tested Programs
Enrollment > 10,000,000 people
+0
Enrollment 4,000,001 – 10,000,000
+1
Enrollment 300,001 – 4,000,000
+2
Enrollment 100,001 – 300,000
+3
Enrollment 20,001 – 100,000
+5
Enrollment ≤20,000
+7
Geography
If the level of reporting is best described by the geography of the individual/service, use one of the following two tables. Specifically, if the geography of the reporting is based on the residence of the individual, use the “Residence Geography” table. If the geography of the reporting is based on the location of service, use the “Service Geography” table
Residence Geography
State or geography with population >2,000,000
-5
Population 1,000,001 - 2,000,000
-3
Population 560,001 - 1,000,000
-1
Population 250,001 - 560,000
0
Population 100,001 - 250,000
+1
Population 50,001 - 100,000
+3
Population 20,001 - 50,000
+4
Population 4,001 - 20,000
+5
Population ≤ 4,000
+7
Service Geography
State or geography with population >2,000,000
-5
Population 1,000,001 - 2,000,000
-4
Population 560,001 - 1,000,000
-3
Population 250,001 - 560,000
-1
Population of reporting region 20,001 - 250,000
0
Population of reporting region ≤20,000
+1
Address (Street and ZIP)
+3
Address in rural area
+5
Address in frontier area
+7
Time - Reporting Period
5 years aggregated
-5
2-4 years aggregated
-3
1 year (e.g., 2001)
0
Bi-Annual
+3
Quarterly
+4
Monthly
+5
Variable Interactions
Only Events (minimum of 5), Time, and Population (Residence/Service Geography or Insurance Coverage)
-5
Only Events (minimum of 3), Time, and Population (Residence/Service Geo. or Insurance Coverage)
-3
Only Events (no minimum), Time, and Population (Residence/Service Geo. or Insurance Coverage)
0
Events, Time, and Population (Residence/Service Geo. or Insurance Coverage) + 1 variable
+1
Events, Time, and Population (Residence/Service Geo. or Insurance Coverage) + 2 variables
+2
Events, Time, and Population (Residence/Service Geo. or Insurance Coverage) + 3 variables
+4
Last updated
Was this helpful?