Get SOC classification utils working for SA 576/soc single step prompt#15
Get SOC classification utils working for SA 576/soc single step prompt#15dstewartons merged 56 commits intomainfrom
Conversation
- sa_rag_soc_code: add optional short_list (from vector store) use LCEL chain, response.content - ChatVertexAI: location=europe-west1, model_kwargs thinking_budget=0 (mirror SIC, fix empty content) - ChatOpenAI: model_kwargs max_tokens (mirror SIC) - Remove debug print(model_name)
…nity - langchain ^0.3.0, langchain-community ^0.3.0, langchain-google-vertexai, langchain-openai - numpy, chromadb, sentence-transformers, torch versions aligned for survey-assist-api resolution
- Wrap call in async def run_sa_rag() and asyncio.run(run_sa_rag()) - sa_rag_soc_code is async (SIC alignment)
- sa_rag_soc_code async; use await chain.ainvoke, str(response.content) - Add parse-failure retry with FIX_PARSING_PROMPT (one attempt, same as SIC) - Match SIC logging: LLM response debug, parse error/warning messages
… same way as in sic classfication utils
- Add pylint disable for broad Exception in embed_index try/except - Add docstring for run_sa_rag() - Rename unused returns to _short_list, _prompt
- Use lazy % logging instead of f-strings for logger calls - Add type ignore for ChatOpenAI api_key arg - except ValueError for parse errors (match SIC) - Pylint disables: protected-access (_index_size), unused-argument (include_all), too-many-branches/statements (sa_rag_soc_code) - search_index(query=...) with type ignores on embed fallback path - Shorten/wrap long lines for line-length
- Add # type: ignore # Suspect langchain ver bug on pydantic_object= for SurveyAssistSocResponse and RagResponse parsers
- Use langchain_chroma.Chroma and langchain_core.documents.Document (replace langchain_community / langchain.docstore) - search_index(query: str) required; callers pass query=job_title or '' - Relies on langchain-core and langchain-chroma in pyproject for types
- Pin versions to match SIC (0.3.68, 0.2.4) - Enables mypy-clean embedding.py without type ignores
- Add ClassificationLLM tests: setup, model init, pass-through LLM, default/model name, sa_rag_soc_code, and error cases - Use fixture classification_llm_with_soc_sa_rag_soc (async) and mock_vertex_ai, mirror SIC structure and naming
- EmbeddingHandler tests: embed from file, search (single/multi), handler initialisation by model name - Fixture embedding_handler, docstrings with Args/Asserts
a442c8b to
fdc15b4
Compare
gibbardsteve
left a comment
There was a problem hiding this comment.
A couple of comments/queries. Functionally it looks to be working ok.
I ran testing, which looks good:
No SOC Vector Store:
error returned appropriately
SOC Vector store configured:
classified true, well formatted response returned in API
classified false, well formatted response returned in API
make all-tests is successful
| sentence-transformers = "4.1.0" | ||
| transformers = "4.48.0" | ||
| chromadb = "^0.4.22" | ||
| chromadb = "1.0.12" |
There was a problem hiding this comment.
Adding a dependency for posthog = ">=2.4.0,<6.0.0" will silence the logs the vector store telemetry reports:
Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
There was a problem hiding this comment.
Yes wasn't sure as Posthog was never added to this repo. It seems like an analytics dependency, not part of the core classify/RAG path, so I didn't include it in the scope of the task. I've added it in now.
| #langchain-community = "^0.3.20" | ||
| langchain = "0.1.0" | ||
| langchain = "^0.3.0" | ||
| langchain-core = "0.3.68" |
There was a problem hiding this comment.
We will want to bring these dependencies inline with the recent updates to sic-classification-utils, I can do that as part of my ticket if you prefer?
There was a problem hiding this comment.
No problem, let me update these to match.
| soc_codes = self._prompt_candidate_list( | ||
| short_list, code_digits=code_digits, candidates_limit=candidates_limit | ||
| ) | ||
| else: |
There was a problem hiding this comment.
I don't think we want this code, I think this was legacy behaviour when the llm was responsible for calling the vector store. I think the SIC code just raises an error.
There was a problem hiding this comment.
I was erring on side of caution, the idea was to keep "some" old behaviour so that anything that didn’t pass a short list would still work, and nothing would break. I'm removing the else branch so RAG path does the same as sic llm.py
| logger.exception(err) | ||
| logger.warning("Error from LLMChain, exit early") | ||
| logger.warning("Error from chain, exit early") | ||
| validated_answer_sa = SurveyAssistSocResponse( |
There was a problem hiding this comment.
I don't know if we should be using SocResponse here? It looks like the SIC code does not use the SurveyAssistSicResponse structure.
There was a problem hiding this comment.
We’re actually using SurveyAssistSocResponse here. I didn’t add this, it was already there.
If we wanted to mirror SIC we’d return SocResponse on error and the API would need to handle both types. I was trying to avoid refactoring too much current code to avoid breaking anything. So things like the error-path response type are as they were. Both the success and error paths have always returned SurveyAssistSocResponse. SIC only has one response type (SicResponse), so it uses that for everything. SOC has two: SocResponse (used in e.g. get_soc_code) and SurveyAssistSocResponse for the Survey Assist flow.
There was a problem hiding this comment.
See below, thanks for clarifying, I'll do some changes.
| reasoning = ( | ||
| f"ERROR parse_error=<{parse_error2}>, response=<{response.content}>" | ||
| ) | ||
| validated_answer_sa = SurveyAssistSocResponse( |
There was a problem hiding this comment.
As above with the the response structure, align with SIC and remove SurveyAssistSocResponse if it is not required.
There was a problem hiding this comment.
Ok this will need much refactoring in other areas too, which has knock on effects for changes elsewhere - working on this. Bearing in mind, there will be some soc only differences such as:
- expand_search_terms in sa_rag_soc_code – SOC-specific RAG behaviour for API compatibility - no SIC equivalent needed.
- code_digits=4 (SOC) / 5 (SIC) – Classification standard: SOC is 4-digit, SIC is 5-digit.
- get_soc_code signature (extra SOC fields) – Different inputs (e.g. level of education, manage_others) reflect the different questionnaire/API - not something to unify with SIC.
- titles_limit against activities_limit – Same kind of “limit” concept, different names/domains (job titles and activities) - intentional.
- load_soc_hierarchy and get_soc_meta in soc_data_access – The SOC library exposes hierarchy and meta, SIC only has index + structure and builds hierarchy in embedding. So the extra data-access helpers are from the SOC library shape, not a misalignment with SIC.
There was a problem hiding this comment.
Is there a reason this file is called toy_index? Maybe testing_index would be better?
There was a problem hiding this comment.
It's for consistency named as such in sic-classification-utils
Co-authored-by: Steve Gibbard <steve.gibbard@ons.gov.uk>
…mbedding_handler) - One-shot: use LCEL (prompt | llm) and await chain.ainvoke(), parse response.content; remove LLMChain. - get_soc_code: make async; use f-strings and same parser comment as SIC. - openai_api_key: use Optional[SecretStr], pass to ChatOpenAI without type ignore. - Remove embedding_handler from __init__, self.embed, and _load_embedding_handler. - Add file-level pylint disable (logging-not-lazy, logging-fstring-interpolation, too-many-lines). - sa_rag_soc_code: same logger.error/logger.warning style (f-strings, error=, response_content=); remove extra pylint disable; PydanticOutputParser comment as SIC. - _prompt_candidate_list: f-string logger.debug and logger.warning, use include_all in body instead of unused-argument disable.
- Expect VertexAIEmbeddings to be called with model= not model_name= so test matches embedding.py after alignment with sic-classification-utils
- Drop embedding_handler from all ClassificationLLM() calls to match removal from __init__ - Use SecretStr(openai_api_key) in parametrised test so openai_api_key type matches SIC
Use LCEL for one-shot SOC classification and make get_soc_code async, parsing from response.content instead of LLMChain and response['text']. Switch SOC LLM logging to survey_assist_utils.get_logger with an init logger.info line matching SIC, and align debug and error logging to the same f-string style and error/response_content fields. Keep openai_api_key as Optional[SecretStr] in the signature but align the docstring to match SIC, and remove embedding_handler from the constructor for parity with SIC. Tighten up candidate list logging and behaviour in _prompt_candidate_list and sa_rag_soc_code to mirror the equivalent SIC implementation.
Update LOCATION to europe-west2 to match SIC and pass openai_api_key directly in the parametrised LLM model test instead of wrapping it in SecretStr so the SOC tests follow the same pattern as SIC.
- Add CustomVertexAIEmbeddings with SEMANTIC_SIMILARITY for Vertex models
- Use CustomVertexAIEmbeddings for textembedding-/text-embedding- model names
- Add MAX_BATCH_SIZE (5400) and batch add_documents in embed_index
- Add collection_metadata={"hnsw:space": "l2"} to Chroma in _create_vector_store
- Add os.path checks and try/except logging in _create_vector_store when db_dir set
- Add logger.info/debug for embedding model, vector store creation, and config updates
- Fix search_index_multi: join query slice (not str(slice)) and use list[str] signature
- Update test_embedding to expect and patch CustomVertexAIEmbeddings for Vertex models
- Use survey_assist_utils logger and log LLM initialisation details - Set ChatVertexAI location to europe-west2 and align openai_api_key docstring type - Tidy PydanticOutputParser setup and error logging to satisfy pylint line-length rules
- Expect CustomVertexAIEmbeddings for Vertex model names - Patch CustomVertexAIEmbeddings instead of VertexAIEmbeddings in test_embedding
…x ids - Add four_digit_code and two_digit_code to document metadata (file and index paths) - Set embedding_config['soc_condensed'] from config lookups only (not soc_index_file) - Add embed_index parameter logging and 'Dropping existing vector store' log - Fix index-path document ids to be unique per row (code + title) - Use slice-only for truncated codes to match SIC and satisfy ruff PLR2004
- Support embedding from SOC hierarchy via load_hierarchy and all_leaf_text() when soc is None (load from config files) or when a pre-built SOC is passed - Keep existing file_object path for flat index (line-oriented stream) - Add imports: load_soc_structure, load_hierarchy, SocDB; add type hints for docs/ids - Matches SIC dual-path behaviour: index or hierarchy Change granted by @lukeroantreeONS
- Move file-object doc building into _docs_ids_from_file_object() - Move hierarchy doc building into _docs_ids_from_hierarchy() - Shorten _docs_ids_from_hierarchy docstring for pylint line length - Behaviour unchanged, lint passes without suppression
- Remove importlib.resources (as_file, files) and tuple resolution from embedding.py - Use load_soc_hierarchy from soc_data_access for hierarchy loading - Drop direct use of occupational_classification load_soc_index/load_soc_structure/SocDB - Align with SIC: single data-access layer owns config tuple resolution
- Remove importlib.resources and tuple resolution from llm.py - Use get_soc_meta() for __init__ and load_soc_hierarchy() when self.soc is None - Drop direct use of occupational_classification load_soc_index/load_soc_structure/SocDB/SocMeta - Align with SIC: data-access layer owns config tuple resolution
- Use load_text_from_config from soc_data_access (accepts config tuple) - Remove importlib.resources and tuple resolution from prompt.py - Align with SIC: data-access layer owns config tuple resolution
- Add config_model with LLMConfig, LookupsConfig, FullConfig - Enables get_config() to return typed config and package-relative tuple lookups - Matches SIC config shape for consistent patterns across classification utils
- Add module that accepts (package, path) tuples and resolves via importlib.resources - load_soc_index, load_soc_structure, load_soc_hierarchy, get_soc_meta wrap SOC library - load_text_from_config(resource_ref) takes tuple like SIC, no path string at call sites - Single data-access layer for SOC, embedding, llm, prompt delegate here
- embedding.py: Remove importlib.resources and tuple resolution; use load_soc_hierarchy from soc_data_access; docstring for config-style tuple overrides only - llm.py: Remove importlib.resources and tuple resolution; use get_soc_meta() and load_soc_hierarchy() from soc_data_access; docstring for full index (mirror SIC) - Align with SIC: single data-access layer owns config tuple resolution
- Replace load_text_from_config(soc_condensed) with load_soc_index(soc_index) - Remove importlib.resources and tuple resolution; data-access handles config tuple - One-shot prompt now uses full index like SIC, not condensed text
- Add module: load_soc_index, load_soc_structure, load_soc_hierarchy, get_soc_meta, load_text_from_config - Accept config-style (package, path) tuples only; resolve via importlib.resources; call SOC library - Tuple-only public API for strict parity with SIC data-access
- Delete common_utils.py: it only defined load_text_from_config(file_path: str) - That path-string API was the old way to load config; nothing uses it after switching prompt to soc_data_access - Removing it leaves tuple-only loading via soc_data_access, matching SIC (no path-string option)
- Split soc_structure path string across two lines (was 101 chars, max 100)
- Import SOC from occupational_classification.hierarchy.soc_hierarchy - Change self.soc from Optional[pd.DataFrame] to Optional[SOC] - load_soc_hierarchy returns SOC, fixes assignment and index errors
gibbardsteve
left a comment
There was a problem hiding this comment.
Looks good
Pulled all associated branches.
Ran SOC tests for classified - true and classified - false. Looks good.
📌 Pull Request Template
✨ Summary
This PR updates
soc-classification-utilssuch that the SOC RAG chain (sa_rag_soc_code) matches the proven SIC implementation, works with LangChain 0.3 / Gemini 2.5, and can be used reliably bysurvey-assist-apito deliver single-step SOC classification end-to-end.The changes in this PR have been made to get incoming ONSdigital/survey-assist-api#50 working to rely on this repo as a dependency.
📜 Changes Introduced
sa_rag_soc_codean async LCEL chain (prompt | llm+ainvoke), normalise responses via.content+PydanticOutputParser, addFIX_PARSING_PROMPTwith retry-on-parse-fail, and restoreLLMChainso the legacy one-shot classify path still works.embedding.pywith SIC by usinglangchain-core/langchain-chroma/langchain-community, and updatepyproject.toml/poetry.lockso SOC utils resolve cleanly alongside the Survey Assist stack and Gemini 2.5.tests/test_embedding.py,tests/test_llm.py, andtests/conftest.py, introduce.coveragerc, and wireMakefiletargets (embed-tests,utils-tests,all-tests) with a temporary 65% coverage threshold forall-tests.PydanticOutputParserand LangChain types, clean up examples (e.g.llm_embedding_example.py), and ensuremake check-pythonpasses without extra ignores.✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
1. Install dependencies
2. Static analysis and security checks
This runs isort, Black, Ruff, mypy, pylint, and Bandit over the SOC utils codebase.
3. Run SOC utils tests with coverage
This runs pytest across the project with coverage (currently
--cov-fail-under=65forall-tests; higher thresholds are planned once additional tests are added).4. Smoke integration test (in
survey-assist-api)Repos and branches
survey-assist-api— branchSA-576/SOC-single-step-prompt. Ensurepyproject.tomluses the local path dependency../soc-classification-utilsso this branch is the SOC utils dependency.soc-classification-utils— branchSA-576/SOC-single-step-prompt(this repo).soc-classification-vector-store— use a branch that includes the SOC loading fixes (passes config-style tuples toembed_index, stdlib logging aligned with SIC). See the related soc-classification-vector-store PR for the changes required for the vector store to start correctly with this utils API.Start the SOC vector store
From the
soc-classification-vector-storerepo:Start the API
From
survey-assist-api:poetry install --no-root export SOC_VECTOR_STORE=http://localhost:8089 make run-apiCall the classify endpoint
Expect a response with
classified: trueand a SOC code (e.g.2314for primary school teacher).