Introducing a Python deidentification module

De-identifying Sensitive Information in Text with Python and spaCy

In today’s data-driven world, protecting personally identifiable information (PII) is crucial. Whether you’re working in healthcare, legal, or any field handling sensitive data, the ability to automatically remove or mask personal information from text documents is invaluable. Today, I’m excited to introduce a new Python tool that makes this process both simple and reliable.

The Challenge of De-identification

De-identifying text isn’t as straightforward as it might seem. Simply searching for and replacing known names isn’t enough - you need to:

Identify names you’ve never seen before
Handle pronouns that might reveal gender
Maintain text coherency after replacements
Ensure no information is missed due to context
Deal with the complexities of natural language

Enter the Deidentification Tool

The deidentification Python module leverages the power of spaCy’s Named Entity Recognition (NER) along with custom pronoun handling to provide thorough text de-identification. What sets this tool apart is its innovative approach to text processing and its attention to detail.

Key Features

Smart Name Detection: Uses spaCy’s transformer model to identify person names with high accuracy
Gender Neutralization: Replaces gender-specific pronouns with neutral alternatives
Backward Processing: Implements a unique end-to-beginning replacement strategy
Multiple Passes: Iteratively processes text to catch initially missed entities
Format Options: Supports both plain text and HTML output with visual highlighting
GPU Acceleration: Takes advantage of GPU processing through spaCy when available

The Magic Behind the Scenes

Let’s look at what makes this tool effective:

Backwards Processing Strategy

One of the most interesting aspects of this tool is its backward processing strategy. Instead of replacing identified entities from start to finish (which can cause position shifts), it:

Identifies all entities and pronouns
Sorts them by position in reverse order
Performs replacements from end to beginning

This approach ensures that each replacement’s position remains valid since it hasn’t been affected by previous replacements. Here’s a simple example:

from deidentification import Deidentification

text="""\
John Smith was a quiet man who preferred spending his days alone. One \
afternoon, he found himself lost in thought, wondering if he had made \
the right decisions in life. His mind drifted back to the choices that \
had led him to where he was, and he realized he had never really given \
much thought to the future. Despite the uncertainty, John was content \
with the man he had become, trusting in himself to navigate whatever came next.\
"""

deidentifier = Deidentification()
result = deidentifier.deidentify(text)
print(result)

# for HTML output:
# result = deidentifier.deidentify_with_wrapped_html(text)
"""
PERSON was a quiet man who preferred spending HIS/HER days alone. One afternoon,
HE/SHE found HIMSELF/HERSELF lost in thought, wondering if HE/SHE had made the
right decisions in life. HIS/HER mind drifted back to the choices that had led
HIM/HER to where HE/SHE was, and HE/SHE realized HE/SHE had never really given
much thought to the future. Despite the uncertainty, PERSON was content with the
man HE/SHE had become, trusting in HIMSELF/HERSELF to navigate whatever came next.
"""

HTML Output Demo

deidentification html demo

Iterative Processing

The tool doesn’t stop at a single pass. It continues processing the text until no new entities are detected. This is crucial because sometimes:

Entity recognition improves after initial replacements
Complex name patterns become more apparent
Context changes reveal previously missed entities

Real-World Applications

This tool is particularly useful in several scenarios:

Healthcare: De-identifying patient records and medical notes
Legal: Processing sensitive court documents and client communications
HR: Anonymizing employee records and communications
Research: Preparing data for publication while protecting privacy
Compliance: Meeting GDPR and HIPAA requirements for data handling

Technical Implementation

The implementation is clean and straightforward. Here’s an example of custom configuration:

from deidentification import Deidentification, DeidentificationConfig, DeidentificationOutputStyle

config = DeidentificationConfig(
    spacy_model="en_core_web_trf",
    output_style=DeidentificationOutputStyle.HTML,
    replacement="[REDACTED]",
    debug=True
)
deidentifier = Deidentification(config)

Getting Started

Want to try it out? Installation is simple:

pip install text-deidentification

# or...

pip install git+https://github.com/jftuga/deidentification.git

GitHub repo: https://github.com/jftuga/deidentification

The package requires Python 3.10 or higher and spaCy’s en_core_web_trf model.

Conclusion

De-identification is a critical component of modern data processing pipelines. This tool provides a robust, efficient, and easy-to-use solution for handling sensitive information in text. Whether you’re dealing with medical records, legal documents, or any text containing PII, the deidentification package offers a reliable way to protect privacy while maintaining text coherence.

Give it a try and let me know what you think! The project is open source and contributions are welcome.

De-identifying Sensitive Information in Text with Python and spaCy#

The Challenge of De-identification#

Enter the Deidentification Tool#

Key Features#

The Magic Behind the Scenes#

Backwards Processing Strategy#

HTML Output Demo#

Iterative Processing#

Real-World Applications#

Technical Implementation#

Getting Started#

Conclusion#