總結(jié):Presidio在開源性、靈活性和多語言支持方面具有顯著優(yōu)勢,特別適合需要高度定制化的場景。相比商業(yè)化產(chǎn)品(如Amazon Comprehend 和Google DLP), Presidio 成本更低且更透明;相比純NER 工具(如Flair 和spaCy), Presidio 提供完整的匿名化流水線,功能更全面。因此選擇Presidio框架作為構(gòu)成數(shù)據(jù)匿名功能的核心組件。

Presidio介紹

Presidio 組件介紹

Presidio 提供了一個靈活、可擴展的框架,主要由以下核心組件構(gòu)成:

1. Analyzer(分析器)

○ 功能:負責(zé)檢測和識別文本或圖像中的敏感信息(如姓名、信用卡號、電話號碼、地址等)。

○ 工作原理:使用預(yù)定義或自定義的識別器(Recognizers),結(jié)合自然語言處理(NLP)、正則表達式、預(yù)訓(xùn)練模型等技術(shù),分析輸入數(shù)據(jù)并返回敏感實體的位置和置信度評分。

○ 定制化:支持開發(fā)者創(chuàng)建特定類型的識別器,例如針對全角格式的信用卡號或特定地區(qū)的電話號碼。

2. Anonymizer(匿名化器)

○ 功能:對分析器檢測到的敏感信息進行脫敏處理,例如將姓名替換為占位符(如 )、對信用卡號進行掩碼處理等。

○ 工作原理:根據(jù)配置的策略,執(zhí)行替換、掩碼或自定義脫敏操作,確保輸出數(shù)據(jù)不包含敏感信息。

○ 定制化:支持通過 YAML 文件或代碼定義脫敏規(guī)則,例如指定替換字符或脫敏格式。

3. Image Redactor(圖像脫敏模塊)

○ 功能:處理圖像中的敏感信息,使用光學(xué)字符識別(OCR)技術(shù)檢測圖像中的文本,并對敏感內(nèi)容進行編輯或遮蓋。

○ 應(yīng)用場景:適用于掃描文檔、身份證照片等場景,生成脫敏后的圖像。

○ 目前不太成熟,不建議在生產(chǎn)環(huán)境中使用,未來可以依賴多模態(tài)模型(multimodal-LLM)能力來解決。

4. Structured(格式化數(shù)據(jù)模塊)

○ 功能:擴展了 Presidio 的功能,專注于表格格式和半結(jié)構(gòu)化格式 (JSON) 等結(jié)構(gòu)化數(shù)據(jù)格式。它利用 Presidio-Analyzer 的檢測功能來識別包含個人身份信息 (PII) 的列或鍵,并在這些列/鍵名稱與檢測到的 PII 實體之間建立映射。檢測之后,Presidio-Anonymizer 用于對被識別為包含 PII 的列中的每個值應(yīng)用去識別技術(shù),確保敏感數(shù)據(jù)得到適當?shù)谋Wo。

○ 應(yīng)用場景:適用于json、表格、數(shù)據(jù)庫、數(shù)據(jù)倉庫等格式化批量數(shù)據(jù)的匿名。對AI大語言模型應(yīng)用場景較少。

presidio-analyzer

analyzer引擎:spaCy、stanza、transformers

Presidio 的Analyzer引擎是其PII識別核心,負責(zé)檢測文本中的敏感信息。它結(jié)合了多種技術(shù),包括命名實體識別(NER)、正則表達式、基于規(guī)則的邏輯以及校驗和上下文分析,以確保高精度識別。Analyzer支持以下主要NLP框架和模型:

1. spaCy

介紹:spaCy 是一個高效的開源 NLP 庫,廣泛用于 NER 任務(wù)。Presidio 默認使用 spaCy 的預(yù)訓(xùn)練模型(如 en_core_web_lg)來識別姓名、地點、組織等實體。

特點

■ 輕量級,速度快,適合實時應(yīng)用。

■ 支持多語言模型,包括英語、中文、日語、韓語等。

■ 通過上下文分析增強 PII 識別精度,例如區(qū)分普通詞匯和敏感實體的上下文。

局限性

■ 對于非標準或特定領(lǐng)域的 PII(如行業(yè)特定的編碼),需要自定義規(guī)則。

2. Stanza

介紹:Stanza 是由斯坦福大學(xué)開發(fā)的 NLP 庫,提供高質(zhì)量的 NER 模型,特別適合處理復(fù)雜的語言結(jié)構(gòu)。

特點

■ 支持多語言,包括低資源語言,比如阿拉伯語、泰語、緬甸語、越南語等小語種。

■ 提供細粒度的 NER 標簽,適合需要高精度識別的場景。

■ 集成深度學(xué)習(xí)模型,性能強勁。

局限性

■ 模型較大,部署可能需要更多資源。

■ 對某些語言的支持可能不如 spaCy 成熟。

3. Transformers

介紹:Transformers 是 Hugging Face 提供的 NLP 庫,基于 Transformer 架構(gòu),支持 BERT、RoBERTa 等先進模型。

特點

■ 在 NER 任務(wù)上表現(xiàn)出色,準確率高。

■ 支持微調(diào)自定義模型,以適應(yīng)特定領(lǐng)域的 PII 識別需求。

■ 可以處理長文本和復(fù)雜上下文。

局限性

■ 計算資源需求較高,可能不適合資源受限的環(huán)境。

■ 模型訓(xùn)練和微調(diào)需要專業(yè)知識。

Presidio的Analyzer引擎支持以上三種NLP框架,用戶可以根據(jù)具體需求選擇合適的引擎。例如,在需要高性能和準確率的場景下,可以選擇Transformers;而在資源受限或需要快速部署的場景下,spaCy是一個更好的選擇。

presidio-anonymizer

anonymizer實現(xiàn)方法:replace、redact、hash、mask、encrypt、custom

Presidio的Anonymizer組件提供了多種匿名化方法,允許用戶根據(jù)不同需求選擇合適的策略。這些方法包括:

Presidio的Anonymizer通過這些多樣的方法,為用戶提供了靈活且強大的數(shù)據(jù)保護工具,適用于各種不同場景的數(shù)據(jù)匿名化需求。

探索和實踐

為了實現(xiàn)數(shù)據(jù)匿名功能,GPTBots在平臺增加一個數(shù)據(jù)匿名子系統(tǒng),架構(gòu)如圖:

流程簡述

  1. Agent服務(wù)在訪問模型廠商Maas服務(wù)之前,將content內(nèi)容通過api網(wǎng)關(guān)發(fā)送給Presidio-analyzer進行識別;

  2. Presidio-analyzer將識別結(jié)果返回給Agent服務(wù);

  3. Agent服務(wù)確認結(jié)果有效性后,將content原文和識別結(jié)果一起發(fā)送給Presidio-anonymizer進行匿名操作;

  4. Presidio-anonymizer將匿名后的結(jié)果返回給Agent服務(wù);

  5. Agent服務(wù)接收到匿名后的content結(jié)果,重新拼接為模型請求調(diào)用模型廠商的Maas服務(wù)實現(xiàn)模型推理;

  6. 模型廠商返回推理結(jié)果后,按照匿名記錄進行還原操作,實現(xiàn)敏感數(shù)據(jù)的安全匿名AI推理。

analyzer關(guān)鍵代碼實現(xiàn)

軟件版本

python版本:3.12

presidio版本:2.2.358

spaCy版本:3.8.2

stanza版本:1.10.1

一個簡單例子:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern
from presidio_analyzer.nlp_engine import NlpEngineProvider
from typing import List, Optional
from presidio_analyzer.recognizer_result import RecognizerResult

#定義自定義識別器類
class CustomIDRecognizer(PatternRecognizer):
# 定義要識別的模式:假設(shè)是"ID-" 開頭,后面跟6 位數(shù)字
PATTERNS = [
Pattern(
name="custom_id_pattern",
regex=r"ID-\d{6}",
score=0.85 # 置信度分數(shù)
),
]

# 上下文關(guān)鍵詞,增強識別準確性
CONTEXT = ["id", "identifier", "number"]

def __init__(self, supported_language: str = "en"):
super().__init__(
supported_entity="CUSTOM_ID",
patterns=self.PATTERNS,
context=self.CONTEXT,
supported_language=supported_language
)

def invalidate_result(self, pattern_text: str) -> Optional[List[RecognizerResult]]:
"""
可選:添加額外的驗證邏輯
例如,檢查是否符合特定規(guī)則
"""
return None

#主函數(shù)
def main():
# 創(chuàng)建NLP 引擎
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_lg"}],
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()

#初始化分析引擎并添加自定義識別器
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine,
supported_languages=["en"],
default_language="en"
)

#添加自定義識別器
custom_recognizer = CustomIDRecognizer()
analyzer.registry.add_recognizer(custom_recognizer)

#測試文本
text = "My custom ID is ID-123456 and another one is ID-987654."

# 執(zhí)行分析
results = analyzer.analyze(
text=text,
language="en",
entities=["CUSTOM_ID"],
return_decision_process=True
)

# 輸出結(jié)果
print("Analysis Results:")
for result in results:
print(f"- Entity: {result.entity_type}")
print(f" Text: {text[result.start:result.end]}")
print(f" Score: {result.score}")
print(f" Start: {result.start}, End: {result.end}")
print("---")

if __name__ == "__main__":
main()


api部分

AnalyzerRequest:識別請求協(xié)議

PARAMETER

DESCRIPTION

text

the text to analyze

TYPE: str

language

the language of the text

TYPE: str

entities

List of PII entities that should be looked for in the text. If entities=None then all entities are looked for.

TYPE: Optional[List[str]]DEFAULT: None

correlation_id

cross call ID for this request

TYPE: Optional[str]DEFAULT: None

score_threshold

A minimum value for which to return an identified entity

TYPE: Optional[float]DEFAULT: None

return_decision_process

Whether the analysis decision process steps returned in the response.

TYPE: Optional[bool]DEFAULT: False

ad_hoc_recognizers

List of recognizers which will be used only for this specific request.

TYPE: Optional[List[]]DEFAULT: None

context

List of context words to enhance confidence score if matched with the recognized entity's recognizer context

TYPE: Optional[List[str]]DEFAULT: None

allow_list

List of words that the user defines as being allowed to keep in the text

TYPE: Optional[List[str]]DEFAULT: None

allow_list_match

How the allow_list should be interpreted; either as "exact" or as "regex". - If regex, results which match with any regex condition in the allow_list would be allowed and not be returned as potential PII. - if exact, results which exactly match any value in the allow_list would be allowed and not be returned as potential PII.

TYPE: Optional[str]DEFAULT: 'exact'

regex_flags

regex flags to be used for when allow_list_match is "regex"

TYPE: Optional[int]DEFAULT: DOTALL | MULTILINE | IGNORECASE


請求舉例:

curl -X POST http://localhost:5002/analyze \
-H "Content-Type: application/json" \
-d '{
"text": "我的名字是張偉,電話號碼是138-1234-5678,身份證:371427111111111111,銀行卡號:6214111111111111,ip地址:1.1.1.1,username:testuser,password:123456789pw,住在北京市朝陽區(qū)。",
"language": "zh"
}'

返回結(jié)果:

[
{
"analysis_explanation": null,
"end": 71,
"entity_type": "CREDIT_CARD",
"recognition_metadata":
{
"recognizer_identifier": "ZhCreditCardRecognizer_139805391664224",
"recognizer_name": "ZhCreditCardRecognizer"
},
"score": 1.0,
"start": 55
},
{
"analysis_explanation": null,
"end": 84,
"entity_type": "IP_ADDRESS",
"recognition_metadata":
{
"recognizer_identifier": "ZhIpRecognizer_139805391663648",
"recognizer_name": "ZhIpRecognizer"
},
"score": 1.0,
"start": 77
},
{
"analysis_explanation": null,
"end": 25,
"entity_type": "CHINA_MOBILE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666000",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 13
},
{
"analysis_explanation": null,
"end": 26,
"entity_type": "CHINA_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666288",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 14
},
{
"analysis_explanation": null,
"end": 93,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 85
},
{
"analysis_explanation": null,
"end": 102,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 94
},
{
"analysis_explanation": null,
"end": 111,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 103
},
{
"analysis_explanation": null,
"end": 111,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 103
},
{
"analysis_explanation": null,
"end": 123,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 112
},
{
"analysis_explanation": null,
"end": 123,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.8500000000000001,
"start": 112
},
{
"analysis_explanation": null,
"end": 7,
"entity_type": "PERSON",
"recognition_metadata":
{
"recognizer_identifier": "SpacyRecognizer_139805460569152",
"recognizer_name": "SpacyRecognizer"
},
"score": 0.85,
"start": 5
},
{
"analysis_explanation": null,
"end": 132,
"entity_type": "LOCATION",
"recognition_metadata":
{
"recognizer_identifier": "SpacyRecognizer_139805460569152",
"recognizer_name": "SpacyRecognizer"
},
"score": 0.85,
"start": 126
},
{
"analysis_explanation": null,
"end": 26,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 13
},
{
"analysis_explanation": null,
"end": 26,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 13
},
{
"analysis_explanation": null,
"end": 49,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 31
},
{
"analysis_explanation": null,
"end": 49,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 31
},
{
"analysis_explanation": null,
"end": 43,
"entity_type": "CHINA_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666288",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 31
},
{
"analysis_explanation": null,
"end": 40,
"entity_type": "TAIWAN_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665712",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 31
},
{
"analysis_explanation": null,
"end": 39,
"entity_type": "HONGKONG_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665232",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 31
},
{
"analysis_explanation": null,
"end": 40,
"entity_type": "HONGKONG_MOBILE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665568",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 32
},
{
"analysis_explanation": null,
"end": 49,
"entity_type": "CHINA_PASSPORT",
"recognition_metadata":
{
"recognizer_identifier": "ChinesePassportRecognizer_139805391663456",
"recognizer_name": "ChinesePassportRecognizer"
},
"score": 0.4,
"start": 40
},
{
"analysis_explanation": null,
"end": 71,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 55
},
{
"analysis_explanation": null,
"end": 71,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 55
},
{
"analysis_explanation": null,
"end": 67,
"entity_type": "CHINA_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666288",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 55
},
{
"analysis_explanation": null,
"end": 64,
"entity_type": "TAIWAN_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665712",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 55
},
{
"analysis_explanation": null,
"end": 63,
"entity_type": "HONGKONG_MOBILE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665568",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 55
},
{
"analysis_explanation": null,
"end": 64,
"entity_type": "HONGKONG_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665232",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 56
},
{
"analysis_explanation": null,
"end": 71,
"entity_type": "CHINA_PASSPORT",
"recognition_metadata":
{
"recognizer_identifier": "ChinesePassportRecognizer_139805391663456",
"recognizer_name": "ChinesePassportRecognizer"
},
"score": 0.4,
"start": 62
},
{
"analysis_explanation": null,
"end": 84,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 77
},
{
"analysis_explanation": null,
"end": 84,
"entity_type": "USERNAME",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391667104",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 77
},
{
"analysis_explanation": null,
"end": 93,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 85
},
{
"analysis_explanation": null,
"end": 102,
"entity_type": "PASSWORD",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666768",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 94
},
{
"analysis_explanation": null,
"end": 121,
"entity_type": "CHINA_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391666288",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 112
},
{
"analysis_explanation": null,
"end": 121,
"entity_type": "HONGKONG_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665232",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 113
},
{
"analysis_explanation": null,
"end": 121,
"entity_type": "TAIWAN_LANDLINE_NUMBER",
"recognition_metadata":
{
"recognizer_identifier": "PatternRecognizer_139805391665712",
"recognizer_name": "PatternRecognizer"
},
"score": 0.4,
"start": 113
}
]

api在官網(wǎng)presidio-analyzer/app.py基礎(chǔ)上,優(yōu)化并擴充了功能接口,包括:

  1. registry_recognizer接口,實現(xiàn)客戶自定義recognizer;

  2. remove_recognizer接口,實現(xiàn)移除客戶自定義recognizer;

  3. 為模型添加預(yù)置自定義recognizer,比如EMAIL,采用正則表達式來識別;

  4. 添加多種analyzer引擎,spaCy、stanza、transformers并存,可以支持不同場景的識別;

# -*- coding: utf-8 -*-
"""REST API server for analyzer."""
import json
import logging
import os
import ipaddress
import re
from logging.config import fileConfig
from pathlib import Path
from typing import Tuple,List,Optional,Dict,Literal


from flask import Flask, Response, jsonify, request
from presidio_analyzer import AnalyzerEngine, AnalyzerEngineProvider, AnalyzerRequest
from werkzeug.exceptions import HTTPException

DEFAULT_PORT = "3000"

LOGGING_CONF_FILE = "logging.ini"

WELCOME_MESSAGE = r"""



_______ _______ _______ _______ _________ ______ _________ _______
( ____ )( ____ )( ____ \( ____ \\__ __/( __ \ \__ __/( ___ )
| ( )|| ( )|| ( \/| ( \/ ) ( | ( \ ) ) ( | ( ) |
| (____)|| (____)|| (__ | (_____ | | | | ) | | | | | | |
| _____)| __)| __) (_____ ) | | | | | | | | | | | |
| ( | (\ ( | ( ) | | | | | ) | | | | | | |
| ) | ) \ \__| (____/\/\____) |___) (___| (__/ )___) (___| (___) |
|/ |/ \__/(_______/\_______)\_______/(______/ \_______/(_______)
"""

class RegistryRecognizerRequest:

def __init__(self, req_data: Dict):
self.entity = req_data.get("entity")
self.language = req_data.get("language")
self.regex = req_data.get("regex")
self.context = req_data.get("context")
self.deny_list = req_data.get("deny_list")

class RemoteRecognizerRequest:

def __init__(self, req_data: Dict):
self.entity = req_data.get("entity")
self.language = req_data.get("language")

class Server:
"""HTTP Server for calling Presidio Analyzer."""

def __init__(self):
fileConfig(Path(Path(__file__).parent, LOGGING_CONF_FILE))
self.logger = logging.getLogger("presidio-analyzer")
#self.logger.setLevel(os.environ.get("LOG_LEVEL", self.logger.level))
self.logger.setLevel("DEBUG")
self.app = Flask(__name__)

#nlp默認spacy引擎
analyzer_conf_file = os.environ.get("ANALYZER_CONF_FILE")
nlp_engine_conf_file = os.environ.get("NLP_CONF_FILE")
recognizer_registry_conf_file = os.environ.get("RECOGNIZER_REGISTRY_CONF_FILE")

#增加stanza引擎 對小語種的支持https://stanfordnlp.github.io/stanza/ner_models.html
stanza_nlp_engine_conf_file = os.environ.get("STANZA_NLP_CONF_FILE")

# spacy nlp
self.logger.info("Starting analyzer engine(spacy)")
self.engine: AnalyzerEngine = AnalyzerEngineProvider(
analyzer_engine_conf_file=analyzer_conf_file,
nlp_engine_conf_file=nlp_engine_conf_file,
recognizer_registry_conf_file=recognizer_registry_conf_file
).create_engine()
self.engine.context_aware_enhancer = LemmaContextAwareEnhancer(
context_similarity_factor=0.45, min_score_with_context_similarity=0.35
)

# stanza nlp
self.logger.info("Starting analyzer engine(stanza)")
self.stanza_engine: AnalyzerEngine = AnalyzerEngineProvider(
analyzer_engine_conf_file=analyzer_conf_file,
nlp_engine_conf_file=stanza_nlp_engine_conf_file,
recognizer_registry_conf_file=recognizer_registry_conf_file
).create_engine()

self.stanza_engine.context_aware_enhancer = LemmaContextAwareEnhancer(
context_similarity_factor=0.45, min_score_with_context_similarity=0.35
)

## 為zh模型自定義recognizer
# 創(chuàng)建電子郵件地址識別器
email_recognizer = PatternRecognizer(
supported_entity="EMAIL_ADDRESS",
patterns=[email_pattern],
supported_language="zh",
context=["郵箱", "電子郵件", "聯(lián)系方式","mail","email"] # 添加中文上下文關(guān)鍵詞以提高準確性
)
self.engine.registry.add_recognizer(email_recognizer)

## end
self.logger.info(WELCOME_MESSAGE)

@self.app.route("/health")
def health() -> str:
"""Return basic health probe result."""
return "Presidio Analyzer service is up"
@self.app.route("/remove_recognizer", methods=["POST"])
def remove_recognizer() -> str:
"""Remove a recognizer."""
try:
req_data = RemoteRecognizerRequest(request.get_json())
if not req_data.entity:
raise Exception("No entity provided")

if not req_data.language or isinstance(req_data.language,list) is False:
raise Exception("No language provided, it must be of array type.")

# 存在先刪除
nlp_engine.registry.remove_recognizer(req_data.entity,req_data.language)

except Exception as e:
self.logger.error(
f"Error occurred while calling the interface remove_recognizer. {e}"
)
return jsonify(error=e.args[0]), 500
return "Remove successfully"

@self.app.route("/registry_recognizer", methods=["POST"])
def registry_recognizer() -> str:
"""Registry a recognizer."""
try:
req_data = RegistryRecognizerRequest(request.get_json())
if not req_data.entity:
raise Exception("No entity provided")

if not req_data.language or isinstance(req_data.language,list) is False:
raise Exception("No language provided, it must be of array type.")

if not req_data.regex:
raise Exception("No regex provided")

if is_valid_regex(req_data.regex) is False:
raise Exception(f"Invalid regex({req_data.regex})")

if req_data.context and isinstance(req_data.language,list) is False:
raise Exception("Context must be of array type.")

if req_data.deny_list and isinstance(req_data.deny_list,list) is False:
raise Exception("Deny_list must be of array type.")

# 存在先刪除
nlp_engine.registry.remove_recognizer(req_data.entity,req_data.language)

pattern = Pattern(
name=custom_entity,
regex=req_data.regex,
score=0.4
)
# 創(chuàng)建新的recognizer并注冊
custom_recognizer = PatternRecognizer(
supported_entity=custom_entity,
patterns=[pattern],
supported_language=l,
context=req_data.context, # 添加中文上下文關(guān)鍵詞以提高準確性
deny_list=req_data.deny_list, # deny list代表根據(jù)關(guān)鍵字列表進行替換 字符串精準查詢替換 score=1
)
self.engine.registry.add_recognizer(custom_recognizer)

except Exception as e:
self.logger.error(
f"Error occurred while calling the interface registry_recognizer. {e}"
)
return jsonify(error=e.args[0]), 500
return "Registered successfully"

@self.app.route("/analyze", methods=["POST"])
def analyze() -> Tuple[str, int]:
"""Execute the analyzer function."""
# Parse the request params
##此次省略,可以參考官網(wǎng)

@self.app.route("/recognizers", methods=["GET"])
def recognizers() -> Tuple[str, int]:
"""Return a list of supported recognizers."""
## 此次省略,可以參考官網(wǎng)
@self.app.route("/supportedentities", methods=["GET"])
def supported_entities() -> Tuple[str, int]:
"""Return a list of supported entities."""
##此次省略,可以參考官網(wǎng)
@self.app.errorhandler(HTTPException)
def http_exception(e):
return jsonify(error=e.description), e.code

def create_app(): # noqa
server = Server()
return server.app

if __name__ == "__main__":
app = create_app()
port = int(os.environ.get("PORT", DEFAULT_PORT))
app.run(host="0.0.0.0", port=port)

配置文件部分

主要修改presidio_analyzer/conf目錄下spacy.yaml、stanza.yaml、transformers.yaml文件,主要包括:

1. nlp_engine_name,代表選取什么引擎,支持:spacy、stanza、transformers;

2. models,代表語言和模型的對應(yīng)關(guān)系,不同語言對應(yīng)不同的模型,每個引擎都有屬于自己的模型列表;

3. ner_model_configuration,ner相關(guān)選項:

a. labels_to_ignore: A list of labels to ignore. For example, O (no entity) or entities you are not interested in returning.

b. model_to_presidio_entity_mapping: A mapping between the transformers model labels and the Presidio entity types.

c. low_confidence_score_multiplier: A multiplier to apply to the score of entities with low confidence.

d. low_score_entity_names: A list of entity types to apply the low confidence score multiplier to.

nlp_engine_name: spacy
models:
-
lang_code: en
model_name: en_core_web_lg

ner_model_configuration:
model_to_presidio_entity_mapping:
PER: PERSON
PERSON: PERSON
NORP: NRP
FAC: LOCATION
LOC: LOCATION
LOCATION: LOCATION
GPE: LOCATION
ORG: ORGANIZATION
ORGANIZATION: ORGANIZATION
DATE: DATE_TIME
TIME: DATE_TIME

low_confidence_score_multiplier: 0.4
low_score_entity_names:
-
labels_to_ignore:
- ORG
- ORGANIZATION # has many false positives
- CARDINAL
- EVENT
- LANGUAGE
- LAW
- MONEY
- ORDINAL
- PERCENT
- PRODUCT
- QUANTITY
- WORK_OF_ART

支持PII類型

PERSON

NRP

LOCATION

ORGANIZATION

DATE_TIME

-----下面是擴展預(yù)置自定義識別器(可以靈活擴展)

USERNAME

PASSWORD

CHINA_MOBILE_NUMBER

CHINA_LANDLINE_NUMBER

TAIWAN_MOBILE_NUMBER

TAIWAN_LANDLINE_NUMBER

HONGKONG_MOBILE_NUMBER

HONGKONG_LANDLINE_NUMBER

PHONE_NUMBER

EMAIL_ADDRESS

CHINA_ID

TAIWAN_ID

HKID

CHINA_BANK_ID

CREDIT_CARD

CHINA_PASSPORT

anonymizer實現(xiàn)

anonymizer實現(xiàn)比較簡單,原理就是依據(jù)analyzer的結(jié)果,對原始文本進行匿名編輯操作(在AI場景下,需要自己實現(xiàn)推理結(jié)果的還原)。官網(wǎng)描述的很清晰,舉個簡單例子:

# -*- coding: utf-8 -*-
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig, RecognizerResult
from faker import Faker
import logging

# 配置日志記錄
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def create_fake_name():
"""生成假名的函數(shù),用于匿名化PERSON 實體。"""
fake = Faker() # 在函數(shù)內(nèi)創(chuàng)建Faker 實例,避免全局對象
return fake.name()

def anonymize_text(text: str, analyzer_results: list[RecognizerResult]) -> str:
"""
對文本中的敏感信息進行匿名化處理。

Args:
text (str): 要匿名化的文本。
analyzer_results (list[RecognizerResult]): 分析器識別的實體結(jié)果。

Returns:
str: 匿名化后的文本。

Raises:
ValueError: 如果輸入?yún)?shù)無效。
Exception: 如果匿名化過程發(fā)生錯誤。
"""
if not text or not analyzer_results:
logger.error("Invalid input: text or analyzer_results is empty")
raise ValueError("Text and analyzer_results must not be empty")

try:
# 配置匿名化器
operators = {"PERSON": OperatorConfig("custom", {"lambda": create_fake_name})}
anonymizer = AnonymizerEngine()

# 執(zhí)行匿名化
result = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators=operators
)
logger.info("Text anonymization successful")
return result.text

except Exception as e:
logger.error(f"Anonymization failed: {str(e)}")
raise

def main():
"""主函數(shù),演示文本匿名化流程。"""
try:
# 示例輸入
text_to_anonymize = "My name is Raphael and I like to fish."
analyzer_results = [RecognizerResult(entity_type="PERSON", start=11, end=18, score=0.8)]

# 執(zhí)行匿名化
anonymized_text = anonymize_text(text_to_anonymize, analyzer_results)
print(f"Anonymized text: {anonymized_text}")

except Exception as e:
logger.error(f"Error in main: {str(e)}")
print(f"Error: {str(e)}")

if __name__ == "__main__":
main()

性能測試

服務(wù)器:10核40GB(Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz)

locust測試框架

10user每個user每秒1個請求,每個請求500字符左右

spacy模型:中文最強模型https://spacy.io/models/zh#zh_core_web_trf

api中啟動一個AnalyzerEngine

結(jié)論:presidio-analyzer主要消耗CPU和內(nèi)存(text多長會導(dǎo)致oom,推薦不超過10k),10核支持3qps(rpm=180),推理達到10qps(rpm=600)大概要32核,transformers引擎因為使用transformer架構(gòu)模型,需要用gpu算力來運行,性價比不高,暫時放棄。(上面測試結(jié)果僅供參考)

經(jīng)驗分享

在設(shè)計開發(fā)presidio過程,也遇到過一些問題和思考,這里分享算是拋磚引玉:

1. 在Presidio使用過程中,遇到了幾個代碼中bug,需要自己來依據(jù)需要修改,測試驗證盡量全面;

2. analyzer支持的text最大長度,主要依賴使用的model,需要自己實際測試驗證,避免oom;

3. 自定義recognizer的score可以通過context提升評分,通過score分數(shù)來控制誤識別率;

4. 注冊和移除recognizer接口,自定義和預(yù)置有差別,最好自己來實現(xiàn);

5. 多語言混合場景,兩個解決思路:

a. 采用transformer模型,一個大模型可以同時支持多個語言做ner,不過huggingface上的模型能力參差不齊,訓(xùn)練成本過高,還需要仔細篩選測試驗證;

b. 每個語言模型都跑一次,比如中英文混合,中文和英文模型都跑一遍后在匯總確認做匿名。

6. 生產(chǎn)環(huán)境打包為docker鏡像比較方便,模型文件都要提前下載打包到鏡像中,這樣可以在內(nèi)網(wǎng)離線部署運行;

7. anonymizer本身沒有使用模型,只是根據(jù)analyzer的結(jié)果進行字符串編輯,這一步可以考慮自己開發(fā)實現(xiàn)更為適合,因為匿名后的數(shù)據(jù)需要提交給大模型進行推理,可能會丟失關(guān)鍵信息,同時在大模型返回推理結(jié)果,可能還要還原文本;

8. 增加analyzer的并發(fā)處理能力,可以考慮同時啟動多個實例提升并發(fā)處理能力;也可以考慮在一個實例中初始化多個AnalyzerEngine,實現(xiàn)一個引擎池,多個請求可以分配給不同的AnalyzerEngine處理。兩個方案都可行。不過CPU和內(nèi)存資源消耗也會成倍增長。

參考

  1. https://microsoft.github.io/presidio/learn_presidio/

  2. https://spacy.io/models

  3. https://stanfordnlp.github.io/stanza/ner_models.html

  4. https://huggingface.co/models?pipeline_tag=token-classification

分享文章
微信
微博
復(fù)制鏈接

上一篇:

用戶總收不到推送?EngageLab黑科技讓消息直達,送達率提升40%

下一篇:

從內(nèi)容到體驗,品牌郵件全鏈路智能升級——EngageLab助力全球增長
登錄后可進行留言,請 登錄注冊
0條留言
快速聯(lián)系

熱門文章

相關(guān)文章

內(nèi)容標簽
#GPTBots

極光官方微信公眾號

關(guān)注我們,即時獲取最新極光資訊

0/140
發(fā)送

現(xiàn)在注冊,領(lǐng)取新人大禮包

免費注冊

您的瀏覽器版本過低

為了您在極光官網(wǎng)獲得最佳的訪問體驗,建議您升級最新的瀏覽器。