Skip to content

PII detection

The PII feature applied to Sidra's Platform can evaluate Personally Identifiable Information (PII) that is being ingested into the platform, helping to ensure sensitive data is properly managed and governed. PII can be defined as the information or any data that can be used to distinguish or trace an individual's identity such as name, SSN and biometrics information; either alone or combined with other identifiers such as date of birth or place of birth (quasi-identifiers or linkable information), according to the General Data Protection Regulation (GDPR) and U.S. government.

PII Detection is only available for Data Intake Processes. This option is not available for data intakes via landing zone, such as the ingestion of CSV files.

When is PII detection happening?

PII detection happens at Data Intake Process level. When the trigger configured in each of the configured Data Intake Processes is hit, there is an orchestrator pipeline that is the responsible of orchestrating the metadata and the data extraction execution. For PII detection, there is an additional step after data extraction to check whether PII detection process needs to be run on any of the Entities associated to the Data Intake Process at hand.

To configure whether we want to extract PII from an Entity or from all Entities assigned to a DIP we have two different settings:

  1. One, at Data Intake Process (DIP) level, in the additionalProperties of the DIP. The fields are called piiDetectionEnabled and language. The value for piiDetectionEnabled is a flag to configure whether we detect PII for that DIP or not. The language is a field to configure the language to be used by the Presidio analyzer to analyze the field. For example:

      {"piiDetectionEnabled":"true","language":"spanish"}
    
  2. Another setting, at Entity level, in the additionalProperties field of the Entity. For example:

      {"piiDetectionEnabled":"true","language":"spanish"}
    

From these two settings, the Orchestrator first checks with the Sidra API what is the effective set of Assets for which we need to run the PII detection script (see below on "Details about the PII detection and classification of PII").

This endpoint will do an effective settings calculation as described below:

  • If the setting piiDetectionEnabled is empty at Entity level, the setting will default to the DIP level.
  • If the setting piiDetectionEnabled is set to false at Entity level, then it will not detect PII, no matter what is the value at the DIP level.
  • If the setting piiDetectionEnabled is set to true at Entity level, then it will detect PII for that ENtity, no matter what is the value at the DIP level.
  • If the setting piiDetectionEnabled is empty at DIP level, the setting will default to false, so it will not detect PII for the Assets of the Entities in that DIP.
  • If the setting piiDetectionEnabled is empty at Entity level and is set to false at DIP level, it will not detect PII for the Assets of that Entity.
  • If the setting piiDetectionEnabled is empty at Entity and the setting piiDetectionEnabled is set to true, it will detect the PII for all the Entities in that DIP.
  • If the setting piiDetectionEnabled is empty both at DIP and at Entity level, it will not detect PII for any of the Assets

The endpoint will return the list of effective Asset IDs for which we need to detect PII.

Sidra implements the PII detection via a notebook called piidetection.py. The orchestrator ADF pipeline will not call the PII detection notebook if there is no Asset for which to calculate the PII.

A similar set of rules as above is used to calculate the effective language for PII detection.

  • If the setting language is not specified, the default value is English in any of the cases (DIP or Entity).
  • If the setting language is not specified at Entity level, the setting will default to the DIP level.
  • If the setting language is empty at Entity level and at DIP level, the language will default to english.
  • If the setting language is empty at Entity level and the setting language is spanish at DIP level, the language will be spanish.
  • If the setting language is specified at Entity level, it will take that value, regardless of the value at DIP level.

English and Spanish installed by default when creating the cluster. No other languages are supported at this time.

Additionally, PII detection in Sidra counts with an Application Insights service responsible for the tracking of possible errors during the execution of the notebook PII Detection, enabling notifications for both success and errors executions.

Details about the PII detection and classification of PII

For PII detection, Sidra is using a Microsoft library called Presidio. Presidio, as stated by the project owners, helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

In Sidra only functionality for classifying and detecting PII is being used, and described below.

The following classification categories are detected by Presidio when run on any Entity or Attribute:

  1. Credit card
  2. Crypto (a crypto wallet number)
  3. Date time
  4. Domain name
  5. Email address
  6. IBAN code
  7. IP address
  8. NRP
  9. Location
  10. Person
  11. Phone number
  12. Medical license
  13. US bank number
  14. US driver license
  15. US ITIN (Individual Taxpayer Identification Number)
  16. US passport
  17. US SSN (Social Security Number)
  18. UK NHS
  19. NIF
  20. FIN/NRIC (Singapore National Registration Identification Card)
  21. AU ABN (Australian Business Number)
  22. AU ACN (Australian Company Number)
  23. AU TFN (Australian Tax File Number)
  24. AU medicare (Australian medicare number)

Metadata tagging process for PII information

During this scanning by column of Attributes, tags (which type in Sidra is Autogen) will be assigned when finding PIIs. Two types of tags are assigned to Attributes:

  • A tag to flag whether an Attribute is classified as PII (tag = PII) or not (no tag).
  • A tag or set of tags specifying the type of PII detected, e.g. Person, or Phone Number.

The Entity to which each of Attributes belong, will have this same tag assigned (PII if PII is detected and no tag if PII is not detected).

Also, at the Entity level, an effective union of the classification tags of underlying Attributes is generated and assigned as tags. For example:

  • If one Entity has only two Attributes, one child Attribute is classified as Person, and another Attribute is classified as Phone number, then the parent Entity will these tags: PII, Person and Phone Number.

At the Provider level:

  • The process will just add the flag to classify whether underlying Entities have PII or not.
  • No PII type classification tags are applied at Provider level.