Skip to content

Named Entity Recognition (NER)

Project Neo uses Named Entity Recognition (NER) to identify and extract structured entities from document text. NER runs as a dedicated microservice and processes files asynchronously after text extraction completes.

How It Works

  1. A file is uploaded to a share that has enable_ner_analysis turned on.
  2. Neo extracts text from the file (PDF, DOCX, etc.).
  3. The NER service analyses the extracted text using the share's configured schema.
  4. Detected entities, classifications, and structured extractions are stored in the database.
  5. Results are available immediately through the API.

The NER engine uses zero-shot recognition -- it accepts a list of target entity labels and returns spans with confidence scores, so no task-specific training is needed.

NER Schemas

A schema defines which entity types to extract, which document classifications to apply, and what structured fields to pull from the text. Five pre-built schemas ship with Neo.

Default Schema

General-purpose extraction suitable for most document types.

Entity TypeDescription
personNames of people, individuals, or human beings
organizationCompany names, institutions, agencies, or organizations
locationGeographic locations, cities, countries, addresses
dateDates, time periods, or temporal references
moneyMonetary amounts, prices, or financial values
emailEmail addresses
phonePhone numbers or contact numbers
urlWeb URLs or links

Classifications: document_type (report, memo, email, contract, invoice, policy, manual, other), language (english, spanish, french, german, other)

Default confidence threshold: 0.7

Optimized for contracts, agreements, and court filings.

Entity TypeDescription
partyLegal parties, signatories, or contracting entities
personNames of individuals mentioned in the document
organizationCompany names, law firms, or institutions
dateDates, deadlines, or time periods
moneyMonetary amounts, fees, or financial terms
jurisdictionLegal jurisdictions, courts, or governing law references
case_numberCase numbers, docket numbers, or reference numbers
law_referenceReferences to laws, statutes, or regulations

Classifications: document_type (contract, agreement, amendment, nda, mou, letter_of_intent, court_filing, legal_opinion, terms_of_service, privacy_policy, other), contract_status (draft, pending_signature, executed, expired, terminated)

Structured extraction: contract_terms -- parties, effective_date, expiration_date, term_length, renewal, termination_notice, governing_law, total_value

Default confidence threshold: 0.75

Financial Schema

Tailored for invoices, bank statements, and financial reports.

Entity TypeDescription
companyCompany names, corporations, or business entities
personNames of individuals, executives, or account holders
moneyMonetary amounts, prices, or financial values
percentagePercentage values, rates, or ratios
dateDates, fiscal periods, or time references
account_numberBank account numbers or financial account identifiers
tickerStock ticker symbols
currencyCurrency types or codes

Classifications: document_type (invoice, receipt, bank_statement, financial_report, tax_document, expense_report, purchase_order, quote, other), transaction_type (payment, refund, transfer, deposit, withdrawal, fee, other)

Structured extraction: transaction (amount, date, description, type, account, reference), invoice_details (invoice_number, vendor, customer, subtotal, tax, total, due_date)

Default confidence threshold: 0.8

Healthcare Schema

Designed for medical records, lab reports, and prescriptions.

Entity TypeDescription
patientPatient names or identifiers
providerHealthcare provider names, doctors, or medical staff
organizationHospitals, clinics, or healthcare facilities
medicationDrug names, medications, or pharmaceutical substances
dosageMedication dosages, amounts, or frequencies
conditionMedical conditions, diagnoses, or symptoms
procedureMedical procedures, treatments, or interventions
dateDates, appointment times, or time references
lab_valueLaboratory test values or measurements

Classifications: document_type (medical_record, lab_report, prescription, discharge_summary, referral, insurance_claim, consent_form, other), urgency (routine, urgent, emergency)

Structured extraction: patient_info, prescription, visit_summary

Default confidence threshold: 0.8

HR Schema

Built for resumes, offer letters, and employee records.

Entity TypeDescription
personNames of individuals, candidates, or employees
organizationCompany names, employers, or institutions
job_titleJob titles, positions, or roles
skillSkills, competencies, or qualifications
educationEducational institutions, degrees, or certifications
dateDates, employment periods, or time references
locationWork locations, offices, or addresses
salarySalary amounts, compensation, or benefits
emailEmail addresses
phonePhone numbers

Classifications: document_type (resume, cover_letter, job_description, offer_letter, performance_review, employee_handbook, policy, other), experience_level (entry, mid, senior, executive)

Structured extraction: candidate_info, employment_history, education

Default confidence threshold: 0.7

Configuration

The NER service is configured through environment variables set on the ner container.

VariableDefaultDescription
NER_MODEL_NAMEfastino/gliner2-base-v1Hugging Face model identifier. The model is downloaded on first startup.
NER_CONFIDENCE_THRESHOLD0.7Global minimum confidence score (0.0--1.0). Per-share thresholds override this.
NER_DEVICEautoCompute device: auto, cuda, or cpu. auto selects CUDA when a GPU is detected.
NER_MAX_BATCH_SIZE32Upper limit for batch sizing on GPU. Reduce if you encounter out-of-memory errors.
NER_CPU_BATCH_SIZE16Fixed batch size when running on CPU.

Per-Share NER Settings

NER is enabled and configured per share through the share's rules. When creating or updating a share, include the NER fields in the rules object.

Rule FieldTypeDefaultDescription
enable_ner_analysisboolfalseMaster switch -- set to true to run NER on files in this share.
ner_schemastring"default"Which pre-built schema to use (default, legal, financial, healthcare, hr).
ner_entity_typeslist["person", "organization", "location", "date", "money"]Override the schema's entity types with a custom list.
ner_classificationsobjectnullOverride classification labels.
ner_structured_extractionobjectnullOverride structured extraction fields.
ner_confidence_thresholdfloat0.7Minimum confidence for this share (overrides global and schema defaults).
bash
curl -s -X POST "$NEO_URL/shares" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Legal Contracts",
    "path": "/mnt/contracts",
    "rules": {
      "enable_ner_analysis": true,
      "ner_schema": "legal",
      "ner_confidence_threshold": 0.75
    }
  }'

Example: Custom Entity Types (No Schema)

bash
curl -s -X POST "$NEO_URL/shares" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Research Papers",
    "path": "/mnt/research",
    "rules": {
      "enable_ner_analysis": true,
      "ner_entity_types": ["person", "organization", "date", "location", "chemical_compound", "gene_name"],
      "ner_confidence_threshold": 0.65
    }
  }'

API Endpoints

All NER endpoints are under /ner and require a valid bearer token.

List Schemas

bash
curl -s "$NEO_URL/ner/schemas" \
  -H "Authorization: Bearer $TOKEN" | jq .

Returns all registered schemas with their entity types, classification availability, and structured extraction availability.

Get a Specific Schema

bash
curl -s "$NEO_URL/ner/schemas/legal" \
  -H "Authorization: Bearer $TOKEN" | jq .

Get NER Results for a File

bash
curl -s "$NEO_URL/ner/files/{file_id}" \
  -H "Authorization: Bearer $TOKEN" | jq .

Returns entities, classifications, and structured extractions for a single file.

Get NER Results for a Share

bash
curl -s "$NEO_URL/ner/shares/{share_id}/results?page=1&page_size=50" \
  -H "Authorization: Bearer $TOKEN" | jq .

Paginated results across all files in a share. Use entity_type to filter:

bash
curl -s "$NEO_URL/ner/shares/{share_id}/results?entity_type=person" \
  -H "Authorization: Bearer $TOKEN" | jq .

Global NER Statistics

bash
curl -s "$NEO_URL/ner/stats" \
  -H "Authorization: Bearer $TOKEN" | jq .

Returns total_entities, total_files_processed, and a breakdown by entity_types.

Per-Share NER Statistics

bash
curl -s "$NEO_URL/ner/shares/{share_id}/stats" \
  -H "Authorization: Bearer $TOKEN" | jq .

Search Entities

Search for entities by value across all shares or within a specific share.

bash
# Search globally
curl -s "$NEO_URL/ner/entities/search?q=NetApp" \
  -H "Authorization: Bearer $TOKEN" | jq .

# Filter by entity type and share
curl -s "$NEO_URL/ner/entities/search?q=NetApp&entity_type=organization&share_id={share_id}" \
  -H "Authorization: Bearer $TOKEN" | jq .

Aggregate Entities

Get aggregated counts of entity values, useful for dashboards and analytics.

bash
# All entities
curl -s "$NEO_URL/ner/entities/aggregate" \
  -H "Authorization: Bearer $TOKEN" | jq .

# Filter by type
curl -s "$NEO_URL/ner/entities/aggregate?entity_type=person&share_id={share_id}&limit=20" \
  -H "Authorization: Bearer $TOKEN" | jq .

Get NER Settings

bash
curl -s "$NEO_URL/ner/settings" \
  -H "Authorization: Bearer $TOKEN" | jq .

Returns the current global NER configuration: enabled, model, batch_size, confidence_threshold, device.

Update NER Settings

bash
curl -s -X PUT "$NEO_URL/ner/settings" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "confidence_threshold": 0.8,
    "device": "cuda"
  }' | jq .

Valid device values: auto, cuda, cpu. When the device is changed, the API forwards the change to the NER service so the model is moved immediately.

Trigger Reanalysis

Queue all files in a share for NER reprocessing. By default, only files without existing results are processed. Pass force=true to reanalyze everything.

bash
# Analyze files that are missing NER results
curl -s -X POST "$NEO_URL/ner/shares/{share_id}/reanalyze" \
  -H "Authorization: Bearer $TOKEN" | jq .

# Force reanalysis of all files
curl -s -X POST "$NEO_URL/ner/shares/{share_id}/reanalyze?force=true" \
  -H "Authorization: Bearer $TOKEN" | jq .

Delete NER Results

bash
# Delete results for a single file
curl -s -X DELETE "$NEO_URL/ner/files/{file_id}" \
  -H "Authorization: Bearer $TOKEN" | jq .

# Delete all results for a share
curl -s -X DELETE "$NEO_URL/ner/shares/{share_id}/results" \
  -H "Authorization: Bearer $TOKEN" | jq .

Check Pending Files

bash
curl -s "$NEO_URL/ner/pending?share_id={share_id}&limit=50" \
  -H "Authorization: Bearer $TOKEN" | jq .

GPU Acceleration

The NER service supports GPU acceleration through NVIDIA CUDA and AMD ROCm. GPU variants are built as separate container images (netapp-neo-ner-cuda and netapp-neo-ner-rocm).

Device Selection

Set NER_DEVICE=auto (the default) to let the engine detect available hardware. It checks for CUDA availability via PyTorch and falls back to CPU if no GPU is found.

Text Chunking

For documents longer than the model's context window, the engine automatically splits text into manageable chunks with overlap between adjacent chunks. Entity spans detected across chunk boundaries are deduplicated in post-processing.

Adaptive Batch Sizing

On GPU, the engine automatically adapts batch sizes to the available VRAM. If out-of-memory errors occur, the batch size is reduced and processing continues. If GPU memory issues persist, the engine falls back to CPU automatically and will attempt to return to GPU after a cooldown period.

VRAM Requirements

A GPU with 4 GB of VRAM is sufficient for small batch sizes. 8 GB or more is recommended for production workloads with larger batch sizes.

Troubleshooting

Model Download Fails on First Startup

The NER service downloads the GLiNER2 model from Hugging Face on first launch. If the container has no internet access, pre-download the model and mount it into the container:

bash
# On a machine with internet access
python3 -c "from gliner2 import GLiNER2; GLiNER2.from_pretrained('fastino/gliner2-base-v1')"

# The model is cached in ~/.cache/huggingface/hub/
# Mount that directory into the container
docker run -v ~/.cache/huggingface:/root/.cache/huggingface ...

Alternatively, set NER_MODEL_NAME to a local path where the model weights are mounted.

CUDA Out-of-Memory (OOM) Errors

Symptoms: log messages containing CUDA out of memory or RuntimeError: CUDA error.

Actions:

  • Reduce NER_CUDA_MAX_TEXT_LENGTH to send smaller chunks (try 4000).
  • Reduce NER_MAX_BATCH_SIZE to 1.
  • Increase NER_MAX_CONSECUTIVE_OOMS if you want the engine to tolerate more OOMs before falling back.
  • If the GPU has limited VRAM (less than 4 GB), set NER_DEVICE=cpu to avoid OOM entirely.

The engine automatically falls back to CPU after repeated OOMs and will attempt to return to CUDA after a cooldown period.

NER Results Are Empty

Check that:

  1. The share has enable_ner_analysis: true in its rules.
  2. The file has completed text extraction (status = completed).
  3. The confidence threshold is not set too high -- try lowering ner_confidence_threshold to 0.5 temporarily.
  4. The NER service is running and reachable from the worker service (check GET /ner/status).

Reprocessing After Schema Change

If you change a share's NER schema or entity types, existing results are not automatically updated. Use the reanalyze endpoint with force=true to reprocess all files with the new configuration:

bash
curl -s -X POST "$NEO_URL/ner/shares/{share_id}/reanalyze?force=true" \
  -H "Authorization: Bearer $TOKEN"