Skip to content

Get SOC classification utils working for SA 576/soc single step prompt#15

Merged
dstewartons merged 56 commits intomainfrom
SA-576/SOC-single-step-prompt
Feb 23, 2026
Merged

Get SOC classification utils working for SA 576/soc single step prompt#15
dstewartons merged 56 commits intomainfrom
SA-576/SOC-single-step-prompt

Conversation

@dstewartons
Copy link
Copy Markdown
Contributor

@dstewartons dstewartons commented Feb 11, 2026

📌 Pull Request Template

Please complete all sections

✨ Summary

This PR updates soc-classification-utils such that the SOC RAG chain (sa_rag_soc_code) matches the proven SIC implementation, works with LangChain 0.3 / Gemini 2.5, and can be used reliably by survey-assist-api to deliver single-step SOC classification end-to-end.

The changes in this PR have been made to get incoming ONSdigital/survey-assist-api#50 working to rely on this repo as a dependency.

📜 Changes Introduced

  • LLM pipeline alignment (feat): Make sa_rag_soc_code an async LCEL chain (prompt | llm + ainvoke), normalise responses via .content + PydanticOutputParser, add FIX_PARSING_PROMPT with retry-on-parse-fail, and restore LLMChain so the legacy one-shot classify path still works.
  • Embedding and deps (chore): Align embedding.py with SIC by using langchain-core / langchain-chroma / langchain-community, and update pyproject.toml / poetry.lock so SOC utils resolve cleanly alongside the Survey Assist stack and Gemini 2.5.
  • Tests and coverage (test): Add tests/test_embedding.py, tests/test_llm.py, and tests/conftest.py, introduce .coveragerc, and wire Makefile targets (embed-tests, utils-tests, all-tests) with a temporary 65% coverage threshold for all-tests.
  • Linting and typing (chore): Fix mypy issues around PydanticOutputParser and LangChain types, clean up examples (e.g. llm_embedding_example.py), and ensure make check-python passes without extra ignores.

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

1. Install dependencies

poetry install --no-root

2. Static analysis and security checks

make check-python

This runs isort, Black, Ruff, mypy, pylint, and Bandit over the SOC utils codebase.

3. Run SOC utils tests with coverage

make all-tests

This runs pytest across the project with coverage (currently --cov-fail-under=65 for all-tests; higher thresholds are planned once additional tests are added).

4. Smoke integration test (in survey-assist-api)

Repos and branches

  • Clone these repos as sibling directories:
    • survey-assist-api — branch SA-576/SOC-single-step-prompt. Ensure pyproject.toml uses the local path dependency ../soc-classification-utils so this branch is the SOC utils dependency.
    • soc-classification-utils — branch SA-576/SOC-single-step-prompt (this repo).
    • soc-classification-vector-store — use a branch that includes the SOC loading fixes (passes config-style tuples to embed_index, stdlib logging aligned with SIC). See the related soc-classification-vector-store PR for the changes required for the vector store to start correctly with this utils API.

Start the SOC vector store

From the soc-classification-vector-store repo:

poetry install --no-root
make run-vector-store

Start the API

From survey-assist-api:

poetry install --no-root
export SOC_VECTOR_STORE=http://localhost:8089
make run-api

Call the classify endpoint

curl -X POST http://localhost:8080/v1/survey-assist/classify \
  -H "Content-Type: application/json" \
  -d '{
    "llm": "gemini",
    "type": "soc",
    "job_title": "Primary school teacher",
    "job_description": "Teaching maths and English to children aged 5-11",
    "org_description": "Primary school"
  }'

Expect a response with classified: true and a SOC code (e.g. 2314 for primary school teacher).

- sa_rag_soc_code: add optional short_list (from vector store) use LCEL chain, response.content
- ChatVertexAI: location=europe-west1, model_kwargs thinking_budget=0 (mirror SIC, fix empty content)
- ChatOpenAI: model_kwargs max_tokens (mirror SIC)
- Remove debug print(model_name)
…nity

- langchain ^0.3.0, langchain-community ^0.3.0, langchain-google-vertexai, langchain-openai
- numpy, chromadb, sentence-transformers, torch versions aligned for survey-assist-api resolution
- Wrap call in async def run_sa_rag() and asyncio.run(run_sa_rag())
- sa_rag_soc_code is async (SIC alignment)
- sa_rag_soc_code async; use await chain.ainvoke, str(response.content)
- Add parse-failure retry with FIX_PARSING_PROMPT (one attempt, same as SIC)
- Match SIC logging: LLM response debug, parse error/warning messages
- Add pylint disable for broad Exception in embed_index try/except
- Add docstring for run_sa_rag()
- Rename unused returns to _short_list, _prompt
- Use lazy % logging instead of f-strings for logger calls
- Add type ignore for ChatOpenAI api_key arg
- except ValueError for parse errors (match SIC)
- Pylint disables: protected-access (_index_size), unused-argument (include_all), too-many-branches/statements (sa_rag_soc_code)
- search_index(query=...) with type ignores on embed fallback path
- Shorten/wrap long lines for line-length
- Add # type: ignore # Suspect langchain ver bug on pydantic_object=
  for SurveyAssistSocResponse and RagResponse parsers
- Use langchain_chroma.Chroma and langchain_core.documents.Document
  (replace langchain_community / langchain.docstore)
- search_index(query: str) required; callers pass query=job_title or ''
- Relies on langchain-core and langchain-chroma in pyproject for types
- Pin versions to match SIC (0.3.68, 0.2.4)
- Enables mypy-clean embedding.py without type ignores
- Add ClassificationLLM tests: setup, model init, pass-through LLM, default/model name, sa_rag_soc_code, and error cases
- Use fixture classification_llm_with_soc_sa_rag_soc (async) and mock_vertex_ai, mirror SIC structure and naming
- EmbeddingHandler tests: embed from file, search (single/multi), handler initialisation by model name
- Fixture embedding_handler, docstrings with Args/Asserts
@dstewartons dstewartons force-pushed the SA-576/SOC-single-step-prompt branch from a442c8b to fdc15b4 Compare February 11, 2026 15:24
@dstewartons dstewartons changed the title Sa 576/soc single step prompt Get SOC classification utils working for SA 576/soc single step prompt Feb 12, 2026
Copy link
Copy Markdown
Contributor

@gibbardsteve gibbardsteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments/queries. Functionally it looks to be working ok.

I ran testing, which looks good:

No SOC Vector Store:
error returned appropriately

SOC Vector store configured:
classified true, well formatted response returned in API
classified false, well formatted response returned in API

make all-tests is successful

Comment thread pyproject.toml Outdated
sentence-transformers = "4.1.0"
transformers = "4.48.0"
chromadb = "^0.4.22"
chromadb = "1.0.12"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a dependency for posthog = ">=2.4.0,<6.0.0" will silence the logs the vector store telemetry reports:

Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given

(see vanna-ai/vanna#917 (comment))

Copy link
Copy Markdown
Contributor Author

@dstewartons dstewartons Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes wasn't sure as Posthog was never added to this repo. It seems like an analytics dependency, not part of the core classify/RAG path, so I didn't include it in the scope of the task. I've added it in now.

Comment thread pyproject.toml Outdated
#langchain-community = "^0.3.20"
langchain = "0.1.0"
langchain = "^0.3.0"
langchain-core = "0.3.68"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to bring these dependencies inline with the recent updates to sic-classification-utils, I can do that as part of my ticket if you prefer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, let me update these to match.

soc_codes = self._prompt_candidate_list(
short_list, code_digits=code_digits, candidates_limit=candidates_limit
)
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this code, I think this was legacy behaviour when the llm was responsible for calling the vector store. I think the SIC code just raises an error.

Copy link
Copy Markdown
Contributor Author

@dstewartons dstewartons Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was erring on side of caution, the idea was to keep "some" old behaviour so that anything that didn’t pass a short list would still work, and nothing would break. I'm removing the else branch so RAG path does the same as sic llm.py

logger.exception(err)
logger.warning("Error from LLMChain, exit early")
logger.warning("Error from chain, exit early")
validated_answer_sa = SurveyAssistSocResponse(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we should be using SocResponse here? It looks like the SIC code does not use the SurveyAssistSicResponse structure.

Copy link
Copy Markdown
Contributor Author

@dstewartons dstewartons Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re actually using SurveyAssistSocResponse here. I didn’t add this, it was already there.

If we wanted to mirror SIC we’d return SocResponse on error and the API would need to handle both types. I was trying to avoid refactoring too much current code to avoid breaking anything. So things like the error-path response type are as they were. Both the success and error paths have always returned SurveyAssistSocResponse. SIC only has one response type (SicResponse), so it uses that for everything. SOC has two: SocResponse (used in e.g. get_soc_code) and SurveyAssistSocResponse for the Survey Assist flow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below, thanks for clarifying, I'll do some changes.

reasoning = (
f"ERROR parse_error=<{parse_error2}>, response=<{response.content}>"
)
validated_answer_sa = SurveyAssistSocResponse(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above with the the response structure, align with SIC and remove SurveyAssistSocResponse if it is not required.

Copy link
Copy Markdown
Contributor Author

@dstewartons dstewartons Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this will need much refactoring in other areas too, which has knock on effects for changes elsewhere - working on this. Bearing in mind, there will be some soc only differences such as:

  • expand_search_terms in sa_rag_soc_code – SOC-specific RAG behaviour for API compatibility - no SIC equivalent needed.
  • code_digits=4 (SOC) / 5 (SIC) – Classification standard: SOC is 4-digit, SIC is 5-digit.
  • get_soc_code signature (extra SOC fields) – Different inputs (e.g. level of education, manage_others) reflect the different questionnaire/API - not something to unify with SIC.
  • titles_limit against activities_limit – Same kind of “limit” concept, different names/domains (job titles and activities) - intentional.
  • load_soc_hierarchy and get_soc_meta in soc_data_access – The SOC library exposes hierarchy and meta, SIC only has index + structure and builds hierarchy in embedding. So the extra data-access helpers are from the SOC library shape, not a misalignment with SIC.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this file is called toy_index? Maybe testing_index would be better?

Copy link
Copy Markdown
Contributor Author

@dstewartons dstewartons Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for consistency named as such in sic-classification-utils

Co-authored-by: Steve Gibbard <steve.gibbard@ons.gov.uk>
…mbedding_handler)

- One-shot: use LCEL (prompt | llm) and await chain.ainvoke(), parse response.content; remove LLMChain.
- get_soc_code: make async; use f-strings and same parser comment as SIC.
- openai_api_key: use Optional[SecretStr], pass to ChatOpenAI without type ignore.
- Remove embedding_handler from __init__, self.embed, and _load_embedding_handler.
- Add file-level pylint disable (logging-not-lazy, logging-fstring-interpolation, too-many-lines).
- sa_rag_soc_code: same logger.error/logger.warning style (f-strings, error=, response_content=); remove extra pylint disable; PydanticOutputParser comment as SIC.
- _prompt_candidate_list: f-string logger.debug and logger.warning, use include_all in body instead of unused-argument disable.
- Expect VertexAIEmbeddings to be called with model= not model_name= so test matches embedding.py after alignment with sic-classification-utils
- Drop embedding_handler from all ClassificationLLM() calls to match removal from __init__
- Use SecretStr(openai_api_key) in parametrised test so openai_api_key type matches SIC
Use LCEL for one-shot SOC classification and make get_soc_code async, parsing from response.content instead of LLMChain and response['text'].
Switch SOC LLM logging to survey_assist_utils.get_logger with an init logger.info line matching SIC, and align debug and error logging to the same f-string style and error/response_content fields.
Keep openai_api_key as Optional[SecretStr] in the signature but align the docstring to match SIC, and remove embedding_handler from the constructor for parity with SIC.
Tighten up candidate list logging and behaviour in _prompt_candidate_list and sa_rag_soc_code to mirror the equivalent SIC implementation.
Update LOCATION to europe-west2 to match SIC and pass openai_api_key directly in the parametrised LLM model test instead of wrapping it in SecretStr so the SOC tests follow the same pattern as SIC.
- Add CustomVertexAIEmbeddings with SEMANTIC_SIMILARITY for Vertex models
- Use CustomVertexAIEmbeddings for textembedding-/text-embedding- model names
- Add MAX_BATCH_SIZE (5400) and batch add_documents in embed_index
- Add collection_metadata={"hnsw:space": "l2"} to Chroma in _create_vector_store
- Add os.path checks and try/except logging in _create_vector_store when db_dir set
- Add logger.info/debug for embedding model, vector store creation, and config updates
- Fix search_index_multi: join query slice (not str(slice)) and use list[str] signature
- Update test_embedding to expect and patch CustomVertexAIEmbeddings for Vertex models
- Use survey_assist_utils logger and log LLM initialisation details
- Set ChatVertexAI location to europe-west2 and align openai_api_key docstring type
- Tidy PydanticOutputParser setup and error logging to satisfy pylint line-length rules
- Expect CustomVertexAIEmbeddings for Vertex model names
- Patch CustomVertexAIEmbeddings instead of VertexAIEmbeddings in test_embedding
…x ids

- Add four_digit_code and two_digit_code to document metadata (file and index paths)
- Set embedding_config['soc_condensed'] from config lookups only (not soc_index_file)
- Add embed_index parameter logging and 'Dropping existing vector store' log
- Fix index-path document ids to be unique per row (code + title)
- Use slice-only for truncated codes to match SIC and satisfy ruff PLR2004
- Support embedding from SOC hierarchy via load_hierarchy and all_leaf_text()
  when soc is None (load from config files) or when a pre-built SOC is passed
- Keep existing file_object path for flat index (line-oriented stream)
- Add imports: load_soc_structure, load_hierarchy, SocDB; add type hints for docs/ids
- Matches SIC dual-path behaviour: index or hierarchy

Change granted by @lukeroantreeONS
- Move file-object doc building into _docs_ids_from_file_object()
- Move hierarchy doc building into _docs_ids_from_hierarchy()
- Shorten _docs_ids_from_hierarchy docstring for pylint line length
- Behaviour unchanged, lint passes without suppression
- Remove importlib.resources (as_file, files) and tuple resolution from embedding.py
- Use load_soc_hierarchy from soc_data_access for hierarchy loading
- Drop direct use of occupational_classification load_soc_index/load_soc_structure/SocDB
- Align with SIC: single data-access layer owns config tuple resolution
- Remove importlib.resources and tuple resolution from llm.py
- Use get_soc_meta() for __init__ and load_soc_hierarchy() when self.soc is None
- Drop direct use of occupational_classification load_soc_index/load_soc_structure/SocDB/SocMeta
- Align with SIC: data-access layer owns config tuple resolution
- Use load_text_from_config from soc_data_access (accepts config tuple)
- Remove importlib.resources and tuple resolution from prompt.py
- Align with SIC: data-access layer owns config tuple resolution
- Add config_model with LLMConfig, LookupsConfig, FullConfig
- Enables get_config() to return typed config and package-relative tuple lookups
- Matches SIC config shape for consistent patterns across classification utils
- Add module that accepts (package, path) tuples and resolves via importlib.resources
- load_soc_index, load_soc_structure, load_soc_hierarchy, get_soc_meta wrap SOC library
- load_text_from_config(resource_ref) takes tuple like SIC, no path string at call sites
- Single data-access layer for SOC, embedding, llm, prompt delegate here
- embedding.py: Remove importlib.resources and tuple resolution; use load_soc_hierarchy from soc_data_access; docstring for config-style tuple overrides only
- llm.py: Remove importlib.resources and tuple resolution; use get_soc_meta() and load_soc_hierarchy() from soc_data_access; docstring for full index (mirror SIC)
- Align with SIC: single data-access layer owns config tuple resolution
- Replace load_text_from_config(soc_condensed) with load_soc_index(soc_index)
- Remove importlib.resources and tuple resolution; data-access handles config tuple
- One-shot prompt now uses full index like SIC, not condensed text
- Add module: load_soc_index, load_soc_structure, load_soc_hierarchy, get_soc_meta, load_text_from_config
- Accept config-style (package, path) tuples only; resolve via importlib.resources; call SOC library
- Tuple-only public API for strict parity with SIC data-access
- Delete common_utils.py: it only defined load_text_from_config(file_path: str)
- That path-string API was the old way to load config; nothing uses it after switching prompt to soc_data_access
- Removing it leaves tuple-only loading via soc_data_access, matching SIC (no path-string option)
- Split soc_structure path string across two lines (was 101 chars, max 100)
- Import SOC from occupational_classification.hierarchy.soc_hierarchy
- Change self.soc from Optional[pd.DataFrame] to Optional[SOC]
- load_soc_hierarchy returns SOC, fixes assignment and index errors
Copy link
Copy Markdown
Contributor

@gibbardsteve gibbardsteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good
Pulled all associated branches.
Ran SOC tests for classified - true and classified - false. Looks good.

@dstewartons dstewartons merged commit fe57041 into main Feb 23, 2026
5 checks passed
@dstewartons dstewartons deleted the SA-576/SOC-single-step-prompt branch February 23, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants