6.2 Assessing Potential Risk – Publication Scoring Criteria
Last updated
Was this helpful?
Last updated
Was this helpful?
The Publication Scoring Criteria is provided as an example of a method that meets the requirements of Step 3 in the Data Assessment for Public Release Procedure. It is a tool to assess and quantify potential risk for re-identification of de-identified data based on two identification risks: size of potential population and variable specificity. The Publication Scoring Criteria is used to assess the need to suppress small cells as a result of a small numerator, small denominator, or both small numerator and small denominator where a small numerator is less than 11 and a small denominator is less than 20,001. That is why the Publication Scoring Criteria takes into account both numerator (e.g., Events) and denominator (e.g., Geography) variables.
The Publication Scoring Criteria is based on a framework that has been in use by the Illinois Department of Public Health, Illinois Center for Health Statistics. Various other methods have been used to assess risk and the presence of sensitive or small cells. Public health has a long history of public provision of data and many methods have been used. Further discussion of other methods used to assess tables for sensitive or small cells is found in .
This section provides a more detailed review of the criteria that make up the Publication Scoring Criteria.
Events
1000+ events in a specified population
+2
100-999 events
+3
11-99 events
+5
<11 events
+7
The Events score represents a score for the numerator. The Events category will be scored based on the smallest cell size in the table.
The lowest value for the Events variable (<11 events) which has the highest score (+7) was chosen to be consistent with the Numerator Condition. The Publication Scoring Criteria is used when the Numerator-Denominator Condition is not met.
Therefore, when the Numerator Condition is not met with respect to the Events variable, a high score is given.
Sex
Male or Female
+1
Sex is commonly represented as two categories: male and female. Because the number of categories is small, just knowing a person’s reported sex is not enough to pose a risk of identifying that person. The score of +1 reflects that inclusion of the variable in a table introduces increased specificity; however, that it only has two potential values gives it a low risk.
In cases where an additional stratification of other/unknown is used for sex, the reviewer will need to assess potential for increased risk based on the inclusion of the additional stratification.
Although the variable “Sex” is often called “Gender”, it should not be confused with the variables “sexual orientation” and “gender identity.” According to definitions from the American Psychological Association, “Sexual orientation refers to the sex of those to whom one is sexually and romantically attracted” and “Gender identity refers to “”
Age Range
>10-year age range
+2
6-10 year age range
+3
3-5 year age range
+5
1-2 year age range
+7
Age ranges receive a higher score for smaller ranges of years due to the increased risk for identification.
Of note, the HIPAA Safe Harbor method specifically identifies the following as an identifier: “All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.” Although dates are included in the Safe Harbor list, age (<90 years old) is not. The risk score to age ranges reflects the two components of the scoring criteria: size of the potential population and the variable specificity.
Race Group
White, Asian, Black or African American
+2
White, Asian, Black or African American, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Mixed
+3
Detailed Race
+4
Ethnicity
Hispanic or Latino - yes or no
+2
Detailed ethnicity
+4
Race/Ethnicity Combined
This applies when race and ethnicity are collected in a single data field
White, Asian, Black or African American, Hispanic or Latino
+2
White, Asian, Black or African American, Hispanic or Latino, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, Mixed
+3
Detailed Race/Ethnicity
+4
Race and Ethnicity are collected in a number of different ways on the different state and federal data collection tools. At the federal level, starting in 1997, Office of Management and Budget required federal agencies to use a minimum of five race categories:
White,
Black or African American,
American Indian or Alaska Native,
Asian, and
Native Hawaiian or Other Pacific Islander.
Ethnicity asks individuals if they are Hispanic or Latino. Additional specificity for Ethnicity may be requested.
:
40% White
13% Asian
6% Black or African American
<1% American Indian
<1% Native Hawaiian and other Pacific Islander
37% Hispanic or Latino
Based on these percentages, Race Group at the level of White, Asian and Black or African American is given a score of +2 because the Asian and Black or African American groups are relatively small. If the reporting is for the OMB standard categories, White, Asian, Black or African American, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, and Mixed, then the score is +3. If more specificity is requested for Race Groups the score is +4 because the other groups are much smaller at less than 1% of the overall population. Similarly, for the Hispanic or Latino Ethnicity the score is a +2 for a yes or no answer, whereas more detailed ethnicity results in a higher score of +4.
For Race/Ethnicity Combined fields, the scoring is +2 for the groups White, Asian, Black or African American, Hispanic or Latino. The score is +3 for the OMB standard categories with Hispanic or Latino, White, Asian, Black or African American, American Indian or Alaska Native, Native Hawaiian or Other Pacific Islander, and Mixed. The score is +4 for more detailed categories.
Race and Ethnicity demographics may vary significantly based on geography as well as based on particular conditions. So although the scoring criteria presents a guideline for assessing risk, the population frequencies for the specific geography and/or condition should also be taken into account. Appendix C provides the county specific demographics produced by Department of Finance for reference.
Three scenarios are presented to help demonstrate how to use the three race group and ethnicity scoring criteria.
Complete Cross-Tabulation between Race and Ethnicity Consider this table:
Black
50
250
300
White
200
1000
1200
Asian
5
95
100
Total
255
1345
1600
This is the most granular you can get, so you would add both the Race and Ethnicity score to the overall total for your scoring metric (i.e. greatest risk for re- identification). Note that you can replace “Ethnicity” with “Sex” and the principle still applies—you have a cross-tabulated table of Race and Sex.
Race and Ethnicity merged into exclusive categories
Usually the algorithm is that Ethnicity trumps Race when categorizing. This results in a Hispanic category, with the other categories effectively becoming “Non-Hispanic Race.” So the above table would become:
Black 250
White 1000
Asian 95
Hispanic 255
This is when you would use the combined Race/Ethnicity score in the guidelines for your scoring metric.
Third Scenario – No Interaction between Race and Ethnicity If you did this, the above table would become:
Black 300
White 1200
Asian 100
Hispanic 255
Note that this is the only scenario where you can’t add up all the categories to get a total population. Also you would need to run the scoring metric separately for your Race-only and Ethnicity-only datasets. Like the First Scenario, you can replace Ethnicity with Sex and it still makes sense—you now have two tables, one displaying Race and the other Sex, with no interaction between the two—which lessens the Small Cell Size problem.
Language Spoken
English, Spanish, Other Language
+2
Detailed Language
+4
Total
7,835,022
100
English
4,135,060
52.78
Spanish
2,840,758
36.26
Vietnamese
141,289
1.8
Cantonese
85,750
1.09
Armenian
65,096
0.83
Russian
41,252
0.53
Tagalog
39,361
0.5
Mandarin
35,330
0.45
Hmong
33,594
0.43
Korean
27,814
0.35
Farsi
26,123
0.33
Arabic
23,929
0.31
Cambodian
20,476
0.26
Lao
8,355
0.11
Other Chinese
7,483
0.1
Mien
3,803
0.05
Sign Language
2,637
0.03
Thai
1,940
0.02
Portuguese
1,666
0.02
Ilocano
1,661
0.02
Samoan
1,306
0.02
Japanese
1,215
0.02
French
653
0.01
Turkish
376
0
Hebrew
367
0
Polish
275
0
Italian
252
0
Other and unspecified
287,201
3.67
Based on the above numbers, the majority of individuals speak English or Spanish. Therefore if the table includes “English”, “Spanish”, and “Other Language” as the categories for “Language Spoken”, then the score is +2 which is comparable to reporting Hispanic or Latino Ethnicity as a “Yes or No”.
As noted for Race and Ethnicity demographics, language spoken demographics may vary significantly based on geography as well as based on particular conditions. So although the scoring criteria presents a guideline for assessing risk, the population frequencies for the specific geography and/or condition should also be taken into account.
If more specificity for Language Spoken is being requested with respect to reporting on the other languages in the table above, the request will need to be reviewed on a case by case basis. The additional review is necessary given the variability of language spoken by different populations or geographies and the consideration for potential increased risk of identification.
Time – Reporting Period
5 years aggregated
-5
2-4 years aggregated
-3
1 year (e.g., 2001)
0
Bi-Annual
+3
Quarterly
+4
Monthly
+5
Many reports are published based on the calendar year. However, the combination of years of data is an excellent way to provide increased aggregation in a way that allows for more specificity elsewhere, such as county identifiers. Inversely, the smaller the time period in the data, the closer the time period comes to approximating a date. Thus monthly reported data has a high score of +5.
Of note, the HIPAA Safe Harbor method list includes “All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older.” This is a potential identifier when in combination with other information. This potential as an identifier influences the higher scores in the Publication Scoring Criteria as the time period for aggregation gets smaller.
The “0” value for this variable is set at one year as this is the criteria for Safe Harbor under the HIPAA de-identification standard.
Residence Geography*
State or geography with population >2,000,000
-5
Population 1,000,001 - 2,000,000
-3
Population 560,001 - 1,000,000
-1
Population 250,000 - 560,000
0
Population 100,000 - 250,000
+1
Population 50,001 - 100,000
+3
Population 20,001 - 50,000
+4
Population ≤ 20,000
+5
Service Geography*
State or geography with population >2,000,000
-5
Population 1,000,001 - 2,000,000
-4
Population 560,001 - 1,000,000
-3
Population 250,000 - 560,000
-1
Population of reporting region 20,001 - 250,000
0
Population of reporting region ≤20,000
+1
Address (Street and ZIP)
+3
* If the geography of the reporting is based on the residence of the individual, use the “Residence Geography”. If the geography of the reporting is based on the location of service, use the “Service Geography”.
The Geography score, while it may or may not represent the denominator of the table, does provide a reference to the base population about which the reporting is occurring. This will often be reflected in the title of the table if a statewide table.
Otherwise the geography may be represented in the rows or columns. There are two different scoring sets based on whether the geography reporting is based on the residence of the individual to which the information applies or to the service location.
The scores are higher for geography related to residence address because so much information is publicly available about individuals and their address of residence.
For large populations greater than 560,000, which is equivalent to the size of a state, there is a negative score because the size of the denominator masks the individual. The number 560,000 was chosen as a cut-off because this is the size of the smallest state (Wyoming). We chose to use the cut-off at the smallest state’s population because state level reporting is not listed as one of the 18 identifiers the HIPAA Safe Harbor method.
The scores for the service geography are lower because clients can generally come from diverse locations for services. Although people often seek services or have health conditions close to their homes, they may also travel extensive distances.
Reviewers do need to make sure that there are not constraints associated with services that would mean the service geography and resident geography are the same. For example, if a program publishes service utilization by county and the county services can only be used by county residents, then the service utilization by county is also the county of residence. Scoring should be based on the criteria that results in the highest score and thus the highest risk.
Service Geography includes a level of detail that is identified as “Address (Street and ZIP).” This deals with reporting by provider (hospital, clinic, provider office, etc.) Provider addresses are public information and are public at the street address level. A given provider will tend to have a standard catchment area or the geographic boundaries from which most patients come from. for hospitals.
While this addresses where most patients or clients come from, patients or clients may also come from outside the catchment area. For that reason this does not score as high as the more detailed geography under Residence Geography.
Variable Interactions
Only Events (minimum of 5), Time, and Geography (Residence or Service)
-5
Only Events (minimum of 3), Time, and Geography (Residence or Service)
-3
Only Events (no minimum), Time, and Geography (Residence or Service)
0
Events, Time, and Geography (Residence or Service) + 1 variable
+1
Events, Time, and Geography (Residence or Service) + 2 variables
+2
Events, Time, and Geography (Residence or Service) + 3 variables
+3
This criteria specifically addresses the interaction of the variables in a given data presentation and requires the analyst to identify dependent as opposed to independent variables. This criteria is used with respect to dependent variables. This is demonstrated in the two tables below.
Illustration A: Dependent Variables
In this example the Event (counts of Disease A) is shown for Males who are also 0- 17 years old or Males who are also 18-25 years old. In this case Sex and Age are dependent because the stratification for each variable is stacked. This commonly occurs in pivot tables.
Year 1
6
10
5
8
Year 2
8
14
3
20
Illustration B: Independent Variables
In this example the Event (counts of Disease A) is for Males or Females which is shown side by side to a table with ages 0-17 years old or 18-25 years old. In this case Sex and Age are independent because the stratification for each variable is not stacked. Although the two variables Sex and Age are shown in the same table, they are presented independently of each other. While you can compile the data in Example B from Example A, the reverse is not true.
Year 1
16
13
11
18
Year 2
22
23
11
34
This criteria is structured to have less impact if personal characteristics outside of time and geography are excluded and more impact if multiple personal characteristics are included. This provides for a subtraction of points if the only variables presented are the events (numerator), time and geography and an addition of points for including more variables in a given presentation. With respect to the subtraction of points, the score is based on the minimum value for the Events variable. For example, if the smallest value for the Events is 5 or more, then the score would be -5. However, if the smallest value for the Events is 2, then the score would be 0.
The minimum value for Events of 3 (Only Events (minimum of 3), Time, and Geography (Residence or Service)) is used as a threshold to address concern for pre-existing knowledge by users about individuals. For example, if an entity knows who one person is with disease A and the count for Events is “1” or “2”, then the entity could identify the person they know of or the person they know of plus information about the other person. . For this reason, the threshold of 5 for Events is also given. The threshold of 5 is frequently used in public health reporting regarding various events.
In contrast, if additional demographic variables are added, then the risk increases significantly. For example, for Events, Time and Geography (Residence or Service) with three additional variables, a table would show how many individuals are female by age group by race for a given time period and geography. . For this reason, additional points are added because of the inclusion of multiple dependent variables.
Variables other than those specified in the Publication Scoring Criteria can be released only after an additional review by the department’s Statistical Expert on a case by case basis. A guideline that can be considered in performing this review is the following scoring.
Other Variables
<5 groups or categories
+3
5-9 groups
+5
10+ groups
+7
Considerations include not just the number of groups, but also the characteristics of the variables. Consider whether the variable represents an aggregation (Diagnosis Related Groups) or a specific item (ICD-10 Code). Also consider the availability of the variable to the public when also associated with other information, in particular with variables that may be personal characteristics.
Additional information is provided from San Francisco County on their website on the page "".
Language spoken is captured in a variety of data systems to support individuals in receiving services in the language they speak. The following table is taken from the report: . This frequency distribution was used to determine the groupings for the scoring above.