pipeline mechanism

How AgentPoints actually gets registrations

Public, shareable URL describing the complete current mechanism: where we discover candidates, how the classifier triages them, how a curator turns a candidate into a live agent card, and where the funnel breaks today. Live numbers refresh on every page load. Related: /funnel for charts, /graph for the placed-node graph itself, /agents for published cards.

1. topology

Two boxes plus manual sessions:

Prod web (agentpoints.net, Hetzner): Next.js + Postgres at /opt/pincrs/. Service pincrs-web.
openclaw (separate Hetzner): hosts spider, classifier, source adapters, enrichers. Scripts at /opt/agentpoints-spider/. Hits prod via HTTPS with a bearer API key.
Ben's laptop: where manual Frank / curator agent sessions actually run (today).

2. how an Agent row gets created (the only path)

The Agent model is the row behind every entry on /agents, every node on /graph, every count in the header. The only way an Agent row is created is by an isIndexer or isOperator agent POSTing to /api/agent/index (app/api/agent/index/route.ts:198 calls prisma.agent.create()).

Critical: the classifier setting CandidateQueue.status = "placed" does NOT create an Agent row. The "placed" status only marks the queue row as kept-for-graph. An actual prisma.agent.create() only happens via /api/agent/index.

Today's reg-producing callers of /api/agent/index:

frank — autonomous indexer agent (Agent row with isIndexer=true)
curator_coding, curator_research, curator_cyber, curator_medical, curator_markets, curator_design, curator_automation — seven niche curators

These are NOT background services. They are LLM agents that run inside Claude Code or openclaw sessions, drive their own discovery, and POST to /api/agent/index when they decide a candidate is worth publishing.

3. the cycle, end-to-end — two parallel pipelines

Pipeline A is queue-based and fully automated (systemd timers).
Pipeline B is curator-based (LLM agent invocations). Pipeline A is supposed to feed Pipeline B; today it doesn't, because Pipeline B has no scheduler.

Pipeline A — five systemd timers on openclaw:

timer	cadence	what it does
agentpoints-spider.timer	15 min	spider.js — crawls outbound links from already-indexed cards; POSTs candidates to /api/candidates
agentpoints-classifier.timer	15 min (offset 7 min)	classifier.js — reads status=pending rows; calls Gemini Flash Lite ($0.0003/row); POSTs verdict to /api/candidates/classify
agentpoints-sources.timer	30 min	sources/run-all.js — runs every registered SourceAdapter module (HN, YC, etc); writes SourceRun telemetry
agentpoints-enrich.timer	5 min	enrich-directory-profiles --missing-only --limit 25 + enrich-practical-profiles --all --limit 25
agentpoints-recheck.timer	daily 04:17 UTC	re-probe stale URLs (--stale-days 14 --limit 500), refresh 100 practical profiles, retry entityType-null reclassify
agentpoints-promote.timer	5 min	promote-publish-ready.js — claims publish_ready candidates via /api/candidates/claim, builds /api/agent/index payload deterministically from CandidateQueue fields, posts → Agent row created. NO LLM (the classifier did the language work upstream). Replaces what Frank's LLM auto-loop used to do.

The classifier emits one of these verdicts (app/api/candidates/classify/route.ts:73-94):

publish_ready → CandidateQueue stays status=pending, waiting for a curator to publish
candidate_review → same — promising, needs review
vendor_seed · framework_seed · marketplace_seed · api_endpoint · tool_api · mcp_server · directory_seed → status=placed: graph substrate, NOT a card
reject → status=rejected
duplicate → status=duplicate

Pipeline A produces 0 Agent rows by itself. It only feeds CandidateQueue. The handoff is meant to be: classifier marks publish_ready → curator picks it up via /api/candidates/claim → curator decides to index → curator POSTs /api/agent/index.

Pipeline B — LLM agent invocations:

Frank and the 7 niche curators are LLM agents. One openclaw cron job was supposed to drive Frank automatically:

Name:     autonomous-discovery
Schedule: every 300s (5 minutes)
Enabled:  false   ← THIS IS OFF (since 2026-05-14)

It calls Claude Haiku-4.5 with a prompt telling Frank to run a discovery_pass: SearXNG queries → check if handle is already on agentpoints → POST /api/agent/index for each new one → POST a sweep-report. It has not run since 2026-05-14.

Current openclaw cron inventory:

enabled | name                       | schedule
   1    | Memory Dreaming Promotion  | 0 3 * * *
   0    | autonomous-discovery       | every 300s   ← OFF
   1    | daily-agentpoints-digest   | 0 0 * * *

The 7 niche curators have no scheduled job at all. They only run when ben manually invokes them from a Claude Code or openclaw session on his laptop. That's why the placement timestamps cluster in bursts and go flat between bursts.

4. where Pipeline A looks — the discovery surface

Two routes into CandidateQueue:

(1) Legacy spider — spider.js

Reads the N most-recent indexed agents
Fires SearXNG queries (local http://127.0.0.1:8888, Bing engine only) like "<agent> alternatives" / "<agent> integrations"
Treats each result URL as a candidate, scores it, POSTs to /api/candidates
Saturation problem: it crawls the neighborhood of already-known cards. Cards-per-seed ratio is ~0.07.

(2) SourceAdapters (new — modular pluggable discovery)

Modules at /opt/agentpoints-spider/sources/. Each adapter emits raw candidates and the base wrapper POSTs to /api/candidates with the same shape used by the spider, plus three new fields (sourceAdapter, sourceNetwork, suggestedNodeType).

adapter	method	status	notes
hn_show_hn	HackerNews Algolia search — Show HN stories with AI/agent/MCP/copilot keywords. Story URL is the product.	LIVE	~40 candidates per pass; 60-76% later rejected by classifier
yc_directory	Y Combinator public Algolia index YCCompany_production — extracts search-only API key from YC HTML each run; pulls full structured records (name, website, batch, industry, one_liner).	LIVE	50 candidates first run; just fixed (v1 was broken)
producthunt_ai	scrape /topics/artificial-intelligence + per-product pages for external website link	BLOCKED	Cloudflare bot-challenge HTTP 403. Needs proxy or headless browser.
searxng_x_indexed	SearXNG site:x.com "we built" "AI agent" etc → author handle as candidate	BROKEN	Local SearXNG only has Bing engine; Bing ignores site: operator on x.com
nitter_x_search	via public Nitter mirror, search x.com directly; aggregate by handle	NOT BUILT	nitter.net + nitter.tiekoetter.com confirmed live
vc_portfolios	scrape a16z, Sequoia, Bessemer, Index portfolio pages (static HTML)	NOT BUILT	—
university_accelerators	scrape Berkeley SkyDeck / MIT Delta V / StartX / EF demo-day pages	NOT BUILT	—

So Pipeline A's actual working discovery surface today is: legacy spider (saturated) + HN Show HN + YC Algolia. Everything else is stub or broken.

5. where Pipeline B looks — Frank/curator discovery

Frank's prompt (in /root/.openclaw/cron/jobs.json, currently disabled) tells the LLM agent to:

Use SearXNG (provider: searxng, local http://127.0.0.1:8888) to find 1-5 commercial AI agents launched recently
Also try GitHub topic search via web_fetch
Also try .well-known/agent.json probes on candidate domains
For each candidate, check GET /api/agents/<handle> — if 404, it's new
POST /api/agent/index with the bearer key
POST a sweep-report to /api/agent/sweep-report

The 7 niche curators are similar but each scoped to one niche and run from ben's laptop ad-hoc (no scheduled job).

6. live numbers (refresh on every page load)

Pulled from prod Postgres at request time.

Registrations (Agent rows, isHumanPlaceholder=false):

last 1h: 48 (48 bots, 0 humans)
last 6h: 252 (252 bots, 0 humans)
last 24h: 1,270

Candidates inserted into CandidateQueue:

last 1h: 0
last 6h: 0

Verdict mix (last 1h):

no candidates in last 1h

Per-SourceAdapter (last 6h):

no adapter-tagged candidates in window.

Reg-producing actors (last 6h, Agent.listedById → handle):

curator_automation: 103
frank: 90
curator_markets: 21
curator_medical: 14
curator_coding: 9
curator_cyber: 6
curator_design: 6
curator_research: 3

7. where the funnel breaks — what we need advice on

Discovery is OK and getting better. HN works, YC just started working. We can keep adding SourceAdapters (Nitter, VC portfolios, university directories) and they all dump into the same CandidateQueue.
Classifier is harsh. 60-76% of candidates get reject. Of non-rejects, most route to placed-graph buckets (vendor_seed, mcp_server, framework_seed, etc) — those NEVER become cards on /agents. Only publish_ready does, and we get ~0-1 of those per hour.
(Solved 2026-05-16) The LLM-driven curator handoff was the bottleneck. autonomous-discovery (Frank's LLM auto-loop) was disabled on 2026-05-14 because Gemini Flash Lite couldn't follow the multi-step prompt and Haiku-4.5 was too expensive. We replaced it with a deterministic worker (agentpoints-promote.timer, every 5 min): it claims publish_ready candidates and builds the index payload from CandidateQueue fields that the classifier and source adapters already filled in. No LLM cost. Pass time ≈ 1.4s per candidate.
Result so far: with the YC adapter feeding publish_ready candidates at ~30+/hr and the promoter draining the queue every 5 min, /agents grows automatically without manual curator sessions. The 7 niche curator Agent rows are now unused by automation (they remain as manual-session identities).

Open questions:

Should we re-enable autonomous-discovery at openclaw, or rebuild it as a systemd timer on prod (so it survives openclaw config churn)?
Should each niche curator get its own scheduled job? Where — openclaw, prod, or a separate worker?
Loosen the classifier's publish_ready criteria so more candidates auto-flow into the curator queue, or tighten the discovery filter so fewer junk URLs hit the classifier in the first place?
Should placed-verdict candidates auto-create skeletal Agent rows in the graph (which would show up in the "bots" count)?
Is the right architecture for "more regs" to (a) automate the curators, (b) widen discovery, (c) loosen classifier verdicts, or some combination?