Skip to content

fix: is_nullword("-") now correctly returns True#173

Open
leonhandreke wants to merge 1 commit intomainfrom
leonhandreke/fix-nullword-punctuation
Open

fix: is_nullword("-") now correctly returns True#173
leonhandreke wants to merge 1 commit intomainfrom
leonhandreke/fix-nullword-punctuation

Conversation

@leonhandreke
Copy link
Copy Markdown
Contributor

What

is_nullword("-") was returning False even though "-" is listed in NULLWORDS. Same for "---", "?", and any other punctuation-only entry.

Why

When normalize=False, is_nullword was still loading the lookup set by running every NULLWORDS entry through the default normalize_text normalizer — which calls category_replace with SLUG_CATEGORIES. That maps all punctuation to whitespace, so "-" becomes "" and is silently dropped from the set. The lookup then always misses.

Nobody noticed in the real world because the deprecated rigour.names.is_nullword passthrough always explicitly passes normalizer=normalize_name, and normalize_name doesn't normalize away punctuation — so both the form and the set survived consistently. The mismatch only bites with the newer rigour.text.is_nullword when using the default normalize_text.

The existing tests were actually asserting the broken behavior (assert not is_nullword("---")), though I'm not quite sure what reasoning was behind that.

Fix

Distinguish the two paths in is_nullword: when normalize=False (the caller says the form is already in its final state), load the lookup set with noop_normalizer so non-normalized entries survive intact. When normalize=True, both the form and the set go through the same normalizer as before.

Follow-up worth considering

The normalize=True path still uses normalize_text as its default, which means is_nullword(" - ", normalize=True) would still return False — the slug normalizer strips the - before the lookup. A lax normalizer variant that only casefolds and squashes whitespace (without category_replace) would fix this and make a natural default for is_nullword specifically. Let me know if I should do that follow-up.

🤖 Generated with Claude Code

When `normalize=False`, `is_nullword` was loading the nullwords set with
the slug normalizer, which strips all punctuation — so entries like
`-`, `---`, `?` were never inserted and could never match.

The fix distinguishes the two lookup paths: when `normalize=True`, both
the form and the wordlist go through the caller-supplied normalizer as
before; when `normalize=False`, the wordlist is loaded with `noop_normalizer`
so non-normalized entries survive intact. This makes the canonical case
`is_nullword("-")` return True as expected.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@leonhandreke leonhandreke changed the title fix: is_nullword("-") now correctly returns True fix: is_nullword("-") now correctly returns True Apr 14, 2026
@leonhandreke leonhandreke requested a review from pudo April 14, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant