Skip to main content

Entity Extractor Agent

Extract named entities (people, places, organizations) from text for structured memory.

Overview

The Entity Extractor Agent identifies and extracts named entities from text, populating the entity memory type with structured information. It enables agents to build knowledge graphs and track relationships between entities across conversations.

When to Use

  • Extract entities from conversation messages (ON_WRITE trigger)
  • Extract entities from AI responses (POST_LLM trigger)
  • Parse documents for entities (on-demand)
  • Build CRM-style knowledge graphs
  • Enable entity-centric queries and recommendations

Dual-Mode Operation

Mode 1: Regex-Based Heuristics (No LLM)

  • Extracts capitalized words (potential proper nouns)
  • Finds @mentions
  • Extracts email addresses
  • Simple heuristics for classification
  • Output: {"people": ["@john"], "organizations": ["Acme Corp"], "locations": ["NYC"]}

Mode 2: LLM-Powered NER

  • Uses LangChain with structured JSON output
  • Accurate named entity recognition
  • Semantic classification
  • Handles ambiguous cases
  • Output: {"people": ["John Smith"], "organizations": ["Acme Corporation"], "locations": ["New York City"]}

API Methods

extract_entities

Extract entities from text.

async def extract_entities(
text: str
) -> dict[str, list[str]]

Parameters:

  • text: The text to extract entities from

Returns: Dictionary with keys people, organizations, locations (each a list of strings, max 10 per type)

Example:

from memharness import MemoryHarness
from memharness.agents import EntityExtractorAgent
from langchain.chat_models import init_chat_model

async with MemoryHarness("sqlite:///memory.db") as harness:
# Heuristic mode
agent_basic = EntityExtractorAgent(harness)
entities = await agent_basic.extract_entities(
"Dr. Chen works at MIT in Cambridge"
)
# Output: {"people": [], "organizations": ["Mit"], "locations": ["Chen"]}

# LLM mode (accurate)
llm = init_chat_model("gpt-4o-mini")
agent_smart = EntityExtractorAgent(harness, llm=llm)
entities = await agent_smart.extract_entities(
"Dr. Chen works at MIT in Cambridge"
)
# Output: {"people": ["Dr. Chen"], "organizations": ["MIT"], "locations": ["Cambridge"]}

run

Execute the entity extractor agent (standard agent interface).

async def run(
text: str,
**kwargs
) -> dict[str, Any]

Parameters:

  • text: The text to extract entities from
  • **kwargs: Additional arguments (ignored)

Returns: Dictionary with entities and total_extracted keys

Example:

result = await agent.run(text="John works at Acme Corp in NYC")
# Returns: {
# "entities": {
# "people": ["John"],
# "organizations": ["Acme Corp"],
# "locations": ["NYC"]
# },
# "total_extracted": 3
# }

Implementation Details

Heuristic Mode

The heuristic mode uses regex patterns:

import re

def _heuristic_extraction(self, text: str) -> dict[str, list[str]]:
# Extract capitalized words (potential proper nouns)
capitalized = re.findall(r"\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b", text)

# Extract @mentions
mentions = re.findall(r"@([a-zA-Z0-9_]+)", text)

# Extract email addresses (potential people)
emails = re.findall(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", text)
email_names = [email.split("@")[0].replace(".", " ").title() for email in emails]

# Combine and deduplicate
people = list(set(mentions + email_names))
potential_entities = list(set(capitalized))

# Simple heuristic: single words likely locations, multi-word likely orgs
locations = [e for e in potential_entities if len(e.split()) == 1 and len(e) <= 5]
organizations = [e for e in potential_entities if len(e.split()) > 1]

return {
"people": people[:10], # Limit to 10
"organizations": organizations[:10],
"locations": locations[:10],
}

Advantages:

  • Instant execution
  • No LLM costs
  • Works offline
  • Good for structured text (@mentions, emails)

Limitations:

  • Low accuracy on ambiguous cases
  • Cannot distinguish entity types reliably
  • Misses non-capitalized entities
  • No semantic understanding

LLM Mode

The LLM mode uses structured output:

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create prompt
prompt = ChatPromptTemplate.from_messages([
("system",
"You are a named entity recognition system. Extract people, "
"organizations, and locations from the text. Return JSON with "
'keys: "people", "organizations", "locations" (each a list of strings).'),
("user", "{text}")
])

# Build chain with JSON output parser
parser = JsonOutputParser()
chain = prompt | self.llm | parser

# Extract entities
result = await chain.ainvoke({"text": text})
entities = {
"people": result.get("people", [])[:10],
"organizations": result.get("organizations", [])[:10],
"locations": result.get("locations", [])[:10],
}

Advantages:

  • High accuracy
  • Semantic understanding
  • Handles ambiguous cases
  • Proper entity classification

Limitations:

  • Requires LLM API access
  • Incurs API costs
  • Slower than heuristic mode
  • May hallucinate entities

Fallback Strategy

The LLM mode automatically falls back to heuristics on errors:

try:
result = await chain.ainvoke({"text": text})
return {
"people": result.get("people", [])[:10],
"organizations": result.get("organizations", [])[:10],
"locations": result.get("locations", [])[:10],
}
except Exception:
# Fall back to heuristic on error
return self._heuristic_extraction(text)

Integration Patterns

1. ON_WRITE Trigger (Automatic Extraction)

Extract entities from every message:

from memharness import MemoryHarness
from memharness.agents import EntityExtractorAgent

async def add_message_with_entities(thread_id: str, role: str, content: str):
"""Add message and automatically extract entities."""
# Add to conversational memory
await harness.add_conversational(thread_id, role, content)

# Extract entities
agent = EntityExtractorAgent(harness, llm=llm)
entities = await agent.extract_entities(content)

# Store extracted entities
for entity_type, entity_list in entities.items():
for entity_name in entity_list:
await harness.add_entity(
name=entity_name,
entity_type=entity_type.rstrip('s'), # "people" → "person"
description=f"Mentioned in conversation: {content[:100]}"
)

2. POST_LLM Trigger (Extract from AI Responses)

Extract entities from agent responses:

async def agent_loop():
"""Main agent loop with entity extraction."""
while True:
user_input = input("User: ")

# Get AI response
response = await llm.ainvoke(user_input)

# Extract entities from response
agent = EntityExtractorAgent(harness, llm=llm)
entities = await agent.extract_entities(response.content)

# Store entities
for entity_type, entity_list in entities.items():
for entity_name in entity_list:
await harness.add_entity(
name=entity_name,
entity_type=entity_type.rstrip('s'),
description=f"Mentioned in agent response"
)

print(f"AI: {response.content}")

3. Batch Extraction (On-Demand)

Extract entities from multiple documents:

async def batch_extract_entities(documents: list[str]):
"""Extract entities from a batch of documents."""
agent = EntityExtractorAgent(harness, llm=llm)

all_entities = {"people": set(), "organizations": set(), "locations": set()}

for doc in documents:
entities = await agent.extract_entities(doc)
all_entities["people"].update(entities["people"])
all_entities["organizations"].update(entities["organizations"])
all_entities["locations"].update(entities["locations"])

# Store unique entities
for entity_type, entity_set in all_entities.items():
for entity_name in entity_set:
await harness.add_entity(
name=entity_name,
entity_type=entity_type.rstrip('s'),
description="Extracted from document corpus"
)

return {k: list(v) for k, v in all_entities.items()}

4. Tool-Called (Inside Loop)

Expose as a LangChain tool:

from langchain_core.tools import tool

@tool
async def extract_entities_tool(text: str) -> dict:
"""Extract named entities from text."""
agent = EntityExtractorAgent(harness, llm=llm)
entities = await agent.extract_entities(text)
return entities

# Agent can call this tool
agent = create_agent(
model="anthropic:claude-sonnet-4-6",
tools=[extract_entities_tool, ...],
)

Configuration

YAML Configuration

agents:
entity_extractor:
enabled: true
llm: gpt-4o-mini

# Trigger on every write
trigger: on_write

# Entity types to extract
entity_types:
- person
- organization
- location
- concept
- system

# Automatic storage
auto_store: true

Python Configuration

from memharness.agents import EntityExtractorAgent
from langchain.chat_models import init_chat_model

# Basic initialization
agent = EntityExtractorAgent(harness)

# With LLM for accurate NER
llm = init_chat_model("gpt-4o-mini")
agent = EntityExtractorAgent(harness, llm=llm)

# Extract entities
entities = await agent.extract_entities("Dr. Chen works at MIT")

Best Practices

1. Use LLM Mode for Production

# Heuristic mode is inaccurate — use for prototyping only
agent_heuristic = EntityExtractorAgent(harness) # Low accuracy

# LLM mode for production
llm = init_chat_model("gpt-4o-mini") # Fast + cheap
agent_production = EntityExtractorAgent(harness, llm=llm)

2. Deduplicate Entities

async def add_entity_safe(name: str, entity_type: str, description: str):
"""Add entity only if it doesn't already exist."""
# Search for existing entity
existing = await harness.search_entity(name, entity_type=entity_type, k=1)

if existing and existing[0].metadata.get("entity_name") == name:
# Update existing entity
print(f"Entity '{name}' already exists")
return existing[0].id
else:
# Add new entity
return await harness.add_entity(name, entity_type, description)

3. Enrich Entities Over Time

async def enrich_entity(entity_name: str, new_info: str):
"""Add information to existing entity."""
# Find entity
results = await harness.search_entity(entity_name, k=1)
if not results:
return

entity = results[0]

# Append new information
updated_description = f"{entity.content}\n{new_info}"

# Update (implementation-specific)
await harness.update_entity(
entity_id=entity.id,
description=updated_description
)

4. Track Entity Relationships

async def extract_and_link_entities(text: str, thread_id: str):
"""Extract entities and track their relationships."""
agent = EntityExtractorAgent(harness, llm=llm)
entities = await agent.extract_entities(text)

# Store entities with relationships
for person in entities["people"]:
for org in entities["organizations"]:
await harness.add_entity(
name=person,
entity_type="person",
description=f"Mentioned in thread {thread_id}",
relationships=[{"target": org, "type": "associated_with"}]
)

for org in entities["organizations"]:
for location in entities["locations"]:
await harness.add_entity(
name=org,
entity_type="organization",
description=f"Mentioned in thread {thread_id}",
relationships=[{"target": location, "type": "located_in"}]
)

5. Use Cheap Models

# Entity extraction is simple — use cheap models
llm = init_chat_model("gpt-4o-mini") # Claude Haiku or GPT-4o-mini
agent = EntityExtractorAgent(harness, llm=llm)

# Don't use expensive models for NER
# llm = init_chat_model("gpt-4o") # ❌ Overkill and expensive

Extended Entity Types

Beyond the default three types, you can extract custom entity types:

# Extend the system prompt for custom entities
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

prompt = ChatPromptTemplate.from_messages([
("system",
"Extract entities from text. Return JSON with keys: "
'"people", "organizations", "locations", "products", "technologies", "events".'),
("user", "{text}")
])

chain = prompt | llm | JsonOutputParser()
result = await chain.ainvoke({"text": "Apple released iPhone 15 at WWDC 2023"})
# Output: {
# "people": [],
# "organizations": ["Apple"],
# "locations": [],
# "products": ["iPhone 15"],
# "technologies": [],
# "events": ["WWDC 2023"]
# }

Next Steps