[#490] Improve AI Evals#499
Merged
Merged
Conversation
…olExecutor inside the container
There was a problem hiding this comment.
Pull request overview
This WIP PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests to support ongoing AI evals work for #490.
Changes:
- Extracted assistant orchestration into
assistant_services.pyand tool-loop/retrieval logic intotool_services.py, updating the DRF view to callrun_assistant. - Added an
eval_assistant.pyscript (CSV output) and a small notebook for side-by-side response review. - Added focused unit tests for the new service modules and adjusted an existing uploadFile test import.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| server/api/views/uploadFile/test_title.py | Switches to absolute import for title to make the test importable under pytest. |
| server/api/views/assistant/views.py | Simplifies the APIView by delegating assistant execution to run_assistant. |
| server/api/views/assistant/tool_services.py | Introduces the search tool schema, tool mapping, and the reasoning/tool-call loop helpers. |
| server/api/views/assistant/assistant_services.py | Adds run_assistant orchestrator wiring OpenAI client + tools + loop. |
| server/api/views/assistant/assistant_prompts.py | Moves the long assistant instruction prompt into a dedicated module constant. |
| server/api/views/assistant/eval_assistant.py | Adds a terminal-run evaluation script that runs questions concurrently and writes CSV results. |
| server/api/views/assistant/review.ipynb | Adds a lightweight notebook to compare two eval result CSVs side-by-side. |
| server/api/views/assistant/test_tool_services.py | Adds unit tests for tool mapping, function-call dispatch, and the reasoning loop behavior. |
| server/api/views/assistant/test_assistant_services.py | Adds unit tests validating run_assistant request shaping and previous-response handling. |
| server/api/views/assistant/test_eval_assistant.py | Adds a unit test ensuring eval rows capture exceptions instead of aborting the batch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
40
to
+42
| message = request.data.get("message", None) | ||
| previous_response_id = request.data.get("previous_response_id", None) | ||
|
|
||
| # Track total duration and cost metrics | ||
| start_time = time.time() | ||
| total_token_usage = {"input_tokens": 0, "output_tokens": 0} | ||
|
|
||
| if not previous_response_id: | ||
| response = client.responses.create( | ||
| input=[ | ||
| {"type": "message", "role": "user", "content": str(message)} | ||
| ], | ||
| **MODEL_DEFAULTS, | ||
| ) | ||
| else: | ||
| response = client.responses.create( | ||
| input=[ | ||
| {"type": "message", "role": "user", "content": str(message)} | ||
| ], | ||
| previous_response_id=str(previous_response_id), | ||
| **MODEL_DEFAULTS, | ||
| ) | ||
|
|
||
| # Accumulate token usage from initial response | ||
| if hasattr(response, "usage"): | ||
| total_token_usage["input_tokens"] += getattr( | ||
| response.usage, "input_tokens", 0 | ||
| ) | ||
| total_token_usage["output_tokens"] += getattr( | ||
| response.usage, "output_tokens", 0 | ||
| ) | ||
|
|
||
| # Open AI Cookbook: Handling Function Calls with Reasoning Models | ||
| # https://cookbook.openai.com/examples/reasoning_function_calls | ||
| while True: | ||
| # Mapping of the tool names we tell the model about and the functions that implement them | ||
| function_responses = invoke_functions_from_response( | ||
| response, tool_mapping={"search_documents": search_documents} | ||
| ) | ||
| if len(function_responses) == 0: # We're done reasoning | ||
| logger.info("Reasoning completed") | ||
| final_response_output_text = response.output_text | ||
| final_response_id = response.id | ||
| logger.info(f"Final response: {final_response_output_text}") | ||
| break | ||
| else: | ||
| logger.info("More reasoning required, continuing...") | ||
| response = client.responses.create( | ||
| input=function_responses, | ||
| previous_response_id=response.id, | ||
| **MODEL_DEFAULTS, | ||
| ) | ||
| # Accumulate token usage from reasoning iterations | ||
| if hasattr(response, "usage"): | ||
| total_token_usage["input_tokens"] += getattr( | ||
| response.usage, "input_tokens", 0 | ||
| ) | ||
| total_token_usage["output_tokens"] += getattr( | ||
| response.usage, "output_tokens", 0 | ||
| ) | ||
|
|
||
| # Calculate total duration and cost metrics | ||
| total_duration = time.time() - start_time | ||
| cost_metrics = calculate_cost_metrics( | ||
| total_token_usage, GPT_5_NANO_PRICING_DOLLARS_PER_MILLION_TOKENS | ||
| ) | ||
|
|
||
| # Log cost and duration metrics | ||
| logger.info( | ||
| f"Request completed: " | ||
| f"Duration: {total_duration:.2f}s, " | ||
| f"Input tokens: {total_token_usage['input_tokens']}, " | ||
| f"Output tokens: {total_token_usage['output_tokens']}, " | ||
| f"Total cost: ${cost_metrics['total_cost']:.6f}" | ||
|
|
Comment on lines
+11
to
+13
| # uv script (or plain Python) to generate results to CSV, run from the terminal | ||
| # Run from inside the container: docker compose exec backend python eval_assistant.py | ||
| # |
Comment on lines
+31
to
+34
| from api.views.assistant.assistant_services import run_assistant | ||
| # TODO: remove unused import or use INSTRUCTIONS to record an instructions_hash column | ||
| from api.views.assistant.assistant_prompts import INSTRUCTIONS | ||
|
|
Comment on lines
+120
to
+122
| """Extract all function calls from the response, look up the corresponding tool function(s) and execute them. | ||
| (This would be a good place to handle asynchroneous tool calls, or ones that take a while to execute.) | ||
| This returns a list of messages to be added to the conversation history. |
Comment on lines
+200
to
+204
| # Mapping of the tool names we tell the model about and the functions that implement them | ||
| function_responses = invoke_functions_from_response(response, tool_mapping) | ||
| if len(function_responses) == 0: # We're done reasoning | ||
| logger.info("Reasoning completed") | ||
| final_response_output_text = response.output_text |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests
Changes:
Related Issue
This PR relates to #490
Manual Tests
Automated Tests
Ran the api/views/assistant/ test suite:
Run the full test suite:
Documentation
Reviewers
Notes