Entity Extractor Agent

Extract named entities (people, places, organizations) from text for structured memory.

Overview

The Entity Extractor Agent identifies and extracts named entities from text, populating the entity memory type with structured information. It enables agents to build knowledge graphs and track relationships between entities across conversations.

When to Use

Extract entities from conversation messages (ON_WRITE trigger)
Extract entities from AI responses (POST_LLM trigger)
Parse documents for entities (on-demand)
Build CRM-style knowledge graphs
Enable entity-centric queries and recommendations

Dual-Mode Operation

Mode 1: Regex-Based Heuristics (No LLM)

Extracts capitalized words (potential proper nouns)
Finds @mentions
Extracts email addresses
Simple heuristics for classification
Output: {"people": ["@john"], "organizations": ["Acme Corp"], "locations": ["NYC"]}

Mode 2: LLM-Powered NER

Uses LangChain with structured JSON output
Accurate named entity recognition
Semantic classification
Handles ambiguous cases
Output: {"people": ["John Smith"], "organizations": ["Acme Corporation"], "locations": ["New York City"]}

API Methods

extract_entities

Extract entities from text.

async def extract_entities(
    text: str
) -> dict[str, list[str]]

Parameters:

text: The text to extract entities from

Returns: Dictionary with keys people, organizations, locations (each a list of strings, max 10 per type)

Example:

from memharness import MemoryHarness
from memharness.agents import EntityExtractorAgent
from langchain.chat_models import init_chat_model

async with MemoryHarness("sqlite:///memory.db") as harness:
    # Heuristic mode
    agent_basic = EntityExtractorAgent(harness)
    entities = await agent_basic.extract_entities(
        "Dr. Chen works at MIT in Cambridge"
    )
    # Output: {"people": [], "organizations": ["Mit"], "locations": ["Chen"]}

    # LLM mode (accurate)
    llm = init_chat_model("gpt-4o-mini")
    agent_smart = EntityExtractorAgent(harness, llm=llm)
    entities = await agent_smart.extract_entities(
        "Dr. Chen works at MIT in Cambridge"
    )
    # Output: {"people": ["Dr. Chen"], "organizations": ["MIT"], "locations": ["Cambridge"]}

run

Execute the entity extractor agent (standard agent interface).

async def run(
    text: str,
    **kwargs
) -> dict[str, Any]

Parameters:

text: The text to extract entities from
**kwargs: Additional arguments (ignored)

Returns: Dictionary with entities and total_extracted keys

Example:

result = await agent.run(text="John works at Acme Corp in NYC")
# Returns: {
#     "entities": {
#         "people": ["John"],
#         "organizations": ["Acme Corp"],
#         "locations": ["NYC"]
#     },
#     "total_extracted": 3
# }

Implementation Details

Heuristic Mode

The heuristic mode uses regex patterns:

import re

def _heuristic_extraction(self, text: str) -> dict[str, list[str]]:
    # Extract capitalized words (potential proper nouns)
    capitalized = re.findall(r"\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b", text)

    # Extract @mentions
    mentions = re.findall(r"@([a-zA-Z0-9_]+)", text)

    # Extract email addresses (potential people)
    emails = re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text)
    email_names = [email.split("@")[0].replace(".", " ").title() for email in emails]

    # Combine and deduplicate
    people = list(set(mentions + email_names))
    potential_entities = list(set(capitalized))

    # Simple heuristic: single words likely locations, multi-word likely orgs
    locations = [e for e in potential_entities if len(e.split()) == 1 and len(e) <= 5]
    organizations = [e for e in potential_entities if len(e.split()) > 1]

    return {
        "people": people[:10],  # Limit to 10
        "organizations": organizations[:10],
        "locations": locations[:10],
    }

Advantages:

Instant execution
No LLM costs
Works offline
Good for structured text (@mentions, emails)

Limitations:

Low accuracy on ambiguous cases
Cannot distinguish entity types reliably
Misses non-capitalized entities
No semantic understanding

LLM Mode

The LLM mode uses structured output:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create prompt
prompt = ChatPromptTemplate.from_messages([
    ("system",
     "You are a named entity recognition system. Extract people, "
     "organizations, and locations from the text. Return JSON with "
     'keys: "people", "organizations", "locations" (each a list of strings).'),
    ("user", "{text}")
])

# Build chain with JSON output parser
parser = JsonOutputParser()
chain = prompt | self.llm | parser

# Extract entities
result = await chain.ainvoke({"text": text})
entities = {
    "people": result.get("people", [])[:10],
    "organizations": result.get("organizations", [])[:10],
    "locations": result.get("locations", [])[:10],
}

Advantages:

High accuracy
Semantic understanding
Handles ambiguous cases
Proper entity classification

Limitations:

Requires LLM API access
Incurs API costs
Slower than heuristic mode
May hallucinate entities

Fallback Strategy

The LLM mode automatically falls back to heuristics on errors:

try:
    result = await chain.ainvoke({"text": text})
    return {
        "people": result.get("people", [])[:10],
        "organizations": result.get("organizations", [])[:10],
        "locations": result.get("locations", [])[:10],
    }
except Exception:
    # Fall back to heuristic on error
    return self._heuristic_extraction(text)

Integration Patterns

1. ON_WRITE Trigger (Automatic Extraction)

Extract entities from every message:

from memharness import MemoryHarness
from memharness.agents import EntityExtractorAgent

async def add_message_with_entities(thread_id: str, role: str, content: str):
    """Add message and automatically extract entities."""
    # Add to conversational memory
    await harness.add_conversational(thread_id, role, content)

    # Extract entities
    agent = EntityExtractorAgent(harness, llm=llm)
    entities = await agent.extract_entities(content)

    # Store extracted entities
    for entity_type, entity_list in entities.items():
        for entity_name in entity_list:
            await harness.add_entity(
                name=entity_name,
                entity_type=entity_type.rstrip('s'),  # "people" → "person"
                description=f"Mentioned in conversation: {content[:100]}"
            )

2. POST_LLM Trigger (Extract from AI Responses)

Extract entities from agent responses:

async def agent_loop():
    """Main agent loop with entity extraction."""
    while True:
        user_input = input("User: ")

        # Get AI response
        response = await llm.ainvoke(user_input)

        # Extract entities from response
        agent = EntityExtractorAgent(harness, llm=llm)
        entities = await agent.extract_entities(response.content)

        # Store entities
        for entity_type, entity_list in entities.items():
            for entity_name in entity_list:
                await harness.add_entity(
                    name=entity_name,
                    entity_type=entity_type.rstrip('s'),
                    description=f"Mentioned in agent response"
                )

        print(f"AI: {response.content}")

3. Batch Extraction (On-Demand)

Extract entities from multiple documents:

async def batch_extract_entities(documents: list[str]):
    """Extract entities from a batch of documents."""
    agent = EntityExtractorAgent(harness, llm=llm)

    all_entities = {"people": set(), "organizations": set(), "locations": set()}

    for doc in documents:
        entities = await agent.extract_entities(doc)
        all_entities["people"].update(entities["people"])
        all_entities["organizations"].update(entities["organizations"])
        all_entities["locations"].update(entities["locations"])

    # Store unique entities
    for entity_type, entity_set in all_entities.items():
        for entity_name in entity_set:
            await harness.add_entity(
                name=entity_name,
                entity_type=entity_type.rstrip('s'),
                description="Extracted from document corpus"
            )

    return {k: list(v) for k, v in all_entities.items()}

4. Tool-Called (Inside Loop)

Expose as a LangChain tool:

from langchain_core.tools import tool

@tool
async def extract_entities_tool(text: str) -> dict:
    """Extract named entities from text."""
    agent = EntityExtractorAgent(harness, llm=llm)
    entities = await agent.extract_entities(text)
    return entities

# Agent can call this tool
agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[extract_entities_tool, ...],
)

Configuration

YAML Configuration

agents:
  entity_extractor:
    enabled: true
    llm: gpt-4o-mini

    # Trigger on every write
    trigger: on_write

    # Entity types to extract
    entity_types:
      - person
      - organization
      - location
      - concept
      - system

    # Automatic storage
    auto_store: true

Python Configuration

from memharness.agents import EntityExtractorAgent
from langchain.chat_models import init_chat_model

# Basic initialization
agent = EntityExtractorAgent(harness)

# With LLM for accurate NER
llm = init_chat_model("gpt-4o-mini")
agent = EntityExtractorAgent(harness, llm=llm)

# Extract entities
entities = await agent.extract_entities("Dr. Chen works at MIT")

Best Practices

1. Use LLM Mode for Production

# Heuristic mode is inaccurate — use for prototyping only
agent_heuristic = EntityExtractorAgent(harness)  # Low accuracy

# LLM mode for production
llm = init_chat_model("gpt-4o-mini")  # Fast + cheap
agent_production = EntityExtractorAgent(harness, llm=llm)

2. Deduplicate Entities

async def add_entity_safe(name: str, entity_type: str, description: str):
    """Add entity only if it doesn't already exist."""
    # Search for existing entity
    existing = await harness.search_entity(name, entity_type=entity_type, k=1)

    if existing and existing[0].metadata.get("entity_name") == name:
        # Update existing entity
        print(f"Entity '{name}' already exists")
        return existing[0].id
    else:
        # Add new entity
        return await harness.add_entity(name, entity_type, description)

3. Enrich Entities Over Time

async def enrich_entity(entity_name: str, new_info: str):
    """Add information to existing entity."""
    # Find entity
    results = await harness.search_entity(entity_name, k=1)
    if not results:
        return

    entity = results[0]

    # Append new information
    updated_description = f"{entity.content}\n{new_info}"

    # Update (implementation-specific)
    await harness.update_entity(
        entity_id=entity.id,
        description=updated_description
    )

4. Track Entity Relationships

async def extract_and_link_entities(text: str, thread_id: str):
    """Extract entities and track their relationships."""
    agent = EntityExtractorAgent(harness, llm=llm)
    entities = await agent.extract_entities(text)

    # Store entities with relationships
    for person in entities["people"]:
        for org in entities["organizations"]:
            await harness.add_entity(
                name=person,
                entity_type="person",
                description=f"Mentioned in thread {thread_id}",
                relationships=[{"target": org, "type": "associated_with"}]
            )

    for org in entities["organizations"]:
        for location in entities["locations"]:
            await harness.add_entity(
                name=org,
                entity_type="organization",
                description=f"Mentioned in thread {thread_id}",
                relationships=[{"target": location, "type": "located_in"}]
            )

5. Use Cheap Models

# Entity extraction is simple — use cheap models
llm = init_chat_model("gpt-4o-mini")  # Claude Haiku or GPT-4o-mini
agent = EntityExtractorAgent(harness, llm=llm)

# Don't use expensive models for NER
# llm = init_chat_model("gpt-4o")  # ❌ Overkill and expensive

Extended Entity Types

Beyond the default three types, you can extract custom entity types:

# Extend the system prompt for custom entities
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

prompt = ChatPromptTemplate.from_messages([
    ("system",
     "Extract entities from text. Return JSON with keys: "
     '"people", "organizations", "locations", "products", "technologies", "events".'),
    ("user", "{text}")
])

chain = prompt | llm | JsonOutputParser()
result = await chain.ainvoke({"text": "Apple released iPhone 15 at WWDC 2023"})
# Output: {
#     "people": [],
#     "organizations": ["Apple"],
#     "locations": [],
#     "products": ["iPhone 15"],
#     "technologies": [],
#     "events": ["WWDC 2023"]
# }

Entity Memory Type — Stores extracted entities
Consolidator Agent — Merges duplicate entities
Context Assembly Agent — Uses entities for context

Next Steps

Consolidator — Merge duplicate entities
Entity Memory — Entity storage details
Knowledge Graphs — Building relationship graphs

Overview​

When to Use​

Dual-Mode Operation​

API Methods​

extract_entities​

run​

Implementation Details​

Heuristic Mode​

LLM Mode​

Fallback Strategy​

Integration Patterns​

1. ON_WRITE Trigger (Automatic Extraction)​

2. POST_LLM Trigger (Extract from AI Responses)​

3. Batch Extraction (On-Demand)​

4. Tool-Called (Inside Loop)​

Configuration​

YAML Configuration​

Python Configuration​

Best Practices​

1. Use LLM Mode for Production​

2. Deduplicate Entities​

3. Enrich Entities Over Time​

4. Track Entity Relationships​

5. Use Cheap Models​

Extended Entity Types​

Related Components​

Next Steps​