[#490] Improve AI Evals by sahilds1 · Pull Request #499 · CodeForPhilly/balancer-main

sahilds1 · 2026-04-17T16:14:11Z

Description

This PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests

Changes:

Extracted assistant orchestration into assistant_services.py and tool-loop/retrieval logic into tool_services.py, updating the DRF view to call run_assistant.
Added an eval_assistant.py script (CSV output) and a small notebook for side-by-side response review.
Added focused unit tests for the new service modules and adjusted an existing uploadFile test import.

Related Issue

This PR relates to #490

Manual Tests

Automated Tests

Ran the api/views/assistant/ test suite:

sahildshah•~/github/balancer-main(490-improve-ai-evals⚡)»   docker compose exec backend pytest api/views/assistant/                                                      [13:25:56]
================================================================================ test session starts ================================================================================
platform linux -- Python 3.11.4, pytest-9.0.2, pluggy-1.6.0
django: version: 4.2.3, settings: balancer_backend.settings (from ini)
rootdir: /usr/src/server
configfile: pytest.ini
plugins: django-4.12.0, anyio-4.12.1
collected 15 items

api/views/assistant/test_assistant_services.py ....                                                                                                                           [ 26%]
api/views/assistant/test_eval_assistant.py .                                                                                                                                  [ 33%]
api/views/assistant/test_tool_services.py ..........                                                                                                                          [100%]

================================================================================ 15 passed in 2.12s =================================================================================

Run the full test suite:

sahildshah•~/github/balancer-main(490-improve-ai-evals⚡)» docker compose exec backend pytest                                                                             [13:33:09]
================================================================================ test session starts ================================================================================
platform linux -- Python 3.11.4, pytest-9.0.2, pluggy-1.6.0
django: version: 4.2.3, settings: balancer_backend.settings (from ini)
rootdir: /usr/src/server
configfile: pytest.ini
plugins: django-4.12.0, anyio-4.12.1
collected 44 items

api/services/test_embedding_services.py ...................                                                                                                                   [ 43%]
api/views/assistant/test_assistant_services.py ....                                                                                                                           [ 52%]
api/views/assistant/test_eval_assistant.py .                                                                                                                                  [ 54%]
api/views/assistant/test_tool_services.py ..........                                                                                                                          [ 77%]
api/views/uploadFile/test_title.py ..........                                                                                                                                 [100%]

================================================================================ 44 passed in 2.02s =================================================================================

Documentation

Reviewers

Notes

…olExecutor inside the container

Copilot

Pull request overview

This WIP PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests to support ongoing AI evals work for #490.

Changes:

Extracted assistant orchestration into assistant_services.py and tool-loop/retrieval logic into tool_services.py, updating the DRF view to call run_assistant.
Added an eval_assistant.py script (CSV output) and a small notebook for side-by-side response review.
Added focused unit tests for the new service modules and adjusted an existing uploadFile test import.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
server/api/views/uploadFile/test_title.py	Switches to absolute import for `title` to make the test importable under pytest.
server/api/views/assistant/views.py	Simplifies the APIView by delegating assistant execution to `run_assistant`.
server/api/views/assistant/tool_services.py	Introduces the search tool schema, tool mapping, and the reasoning/tool-call loop helpers.
server/api/views/assistant/assistant_services.py	Adds `run_assistant` orchestrator wiring OpenAI client + tools + loop.
server/api/views/assistant/assistant_prompts.py	Moves the long assistant instruction prompt into a dedicated module constant.
server/api/views/assistant/eval_assistant.py	Adds a terminal-run evaluation script that runs questions concurrently and writes CSV results.
server/api/views/assistant/review.ipynb	Adds a lightweight notebook to compare two eval result CSVs side-by-side.
server/api/views/assistant/test_tool_services.py	Adds unit tests for tool mapping, function-call dispatch, and the reasoning loop behavior.
server/api/views/assistant/test_assistant_services.py	Adds unit tests validating `run_assistant` request shaping and previous-response handling.
server/api/views/assistant/test_eval_assistant.py	Adds a unit test ensuring eval rows capture exceptions instead of aborting the batch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

            message = request.data.get("message", None)
            previous_response_id = request.data.get("previous_response_id", None)
-
-            # Track total duration and cost metrics
-            start_time = time.time()
-            total_token_usage = {"input_tokens": 0, "output_tokens": 0}
-
-            if not previous_response_id:
-                response = client.responses.create(
-                    input=[
-                        {"type": "message", "role": "user", "content": str(message)}
-                    ],
-                    **MODEL_DEFAULTS,
-                )
-            else:
-                response = client.responses.create(
-                    input=[
-                        {"type": "message", "role": "user", "content": str(message)}
-                    ],
-                    previous_response_id=str(previous_response_id),
-                    **MODEL_DEFAULTS,
-                )
-
-            # Accumulate token usage from initial response
-            if hasattr(response, "usage"):
-                total_token_usage["input_tokens"] += getattr(
-                    response.usage, "input_tokens", 0
-                )
-                total_token_usage["output_tokens"] += getattr(
-                    response.usage, "output_tokens", 0
-                )
-
-            # Open AI Cookbook: Handling Function Calls with Reasoning Models
-            # https://cookbook.openai.com/examples/reasoning_function_calls
-            while True:
-                # Mapping of the tool names we tell the model about and the functions that implement them
-                function_responses = invoke_functions_from_response(
-                    response, tool_mapping={"search_documents": search_documents}
-                )
-                if len(function_responses) == 0:  # We're done reasoning
-                    logger.info("Reasoning completed")
-                    final_response_output_text = response.output_text
-                    final_response_id = response.id
-                    logger.info(f"Final response: {final_response_output_text}")
-                    break
-                else:
-                    logger.info("More reasoning required, continuing...")
-                    response = client.responses.create(
-                        input=function_responses,
-                        previous_response_id=response.id,
-                        **MODEL_DEFAULTS,
-                    )
-                    # Accumulate token usage from reasoning iterations
-                    if hasattr(response, "usage"):
-                        total_token_usage["input_tokens"] += getattr(
-                            response.usage, "input_tokens", 0
-                        )
-                        total_token_usage["output_tokens"] += getattr(
-                            response.usage, "output_tokens", 0
-                        )
-
-            # Calculate total duration and cost metrics
-            total_duration = time.time() - start_time
-            cost_metrics = calculate_cost_metrics(
-                total_token_usage, GPT_5_NANO_PRICING_DOLLARS_PER_MILLION_TOKENS
-            )
-
-            # Log cost and duration metrics
-            logger.info(
-                f"Request completed: "
-                f"Duration: {total_duration:.2f}s, "
-                f"Input tokens: {total_token_usage['input_tokens']}, "
-                f"Output tokens: {total_token_usage['output_tokens']}, "
-                f"Total cost: ${cost_metrics['total_cost']:.6f}"
+


+# uv script (or plain Python) to generate results to CSV, run from the terminal
+# Run from inside the container: docker compose exec backend python eval_assistant.py
+# 


+from api.views.assistant.assistant_services import run_assistant
+# TODO: remove unused import or use INSTRUCTIONS to record an instructions_hash column
+from api.views.assistant.assistant_prompts import INSTRUCTIONS
+


+    """Extract all function calls from the response, look up the corresponding tool function(s) and execute them.
+    (This would be a good place to handle asynchroneous tool calls, or ones that take a while to execute.)
+    This returns a list of messages to be added to the conversation history.


+        # Mapping of the tool names we tell the model about and the functions that implement them
+        function_responses = invoke_functions_from_response(response, tool_mapping)
+        if len(function_responses) == 0: # We're done reasoning
+            logger.info("Reasoning completed")
+            final_response_output_text = response.output_text


Initial commit

e46017b

sahilds1 self-assigned this Apr 17, 2026

sahilds1 added 8 commits April 20, 2026 13:46

Scaffold implementation

2ad1024

Extract logic so it can be called without going through HTTP endpoint

2a9434a

Scaffold eval_assistant.py to run_assistant concurrently via ThreadPo…

31e22ff

…olExecutor inside the container

Add modifications to assistant tests

64da31c

Merge branch 'develop' into 490-improve-ai-evals

b32fa11

Make AI-eval scaffolding mergeable: valid notebook + run_one tests

fc7b723

Tighten the assistant test suite so it covers only logic we wrote

a3027dd

Fix pytest collection error in uploadFile test_title

84f64c4

sahilds1 requested a review from Copilot June 19, 2026 17:38

Copilot started reviewing on behalf of sahilds1 June 19, 2026 17:38 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Documentation-only follow-ups from the branch review

ecd9a72

sahilds1 changed the title ~~[WIP] [#490] Improve AI Evals~~ [#490] Improve AI Evals Jun 19, 2026

sahilds1 marked this pull request as ready for review June 19, 2026 18:13

sahilds1 merged commit 3ad5a1e into CodeForPhilly:develop Jun 19, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#490] Improve AI Evals#499

[#490] Improve AI Evals#499
sahilds1 merged 10 commits into
CodeForPhilly:developfrom
sahilds1:490-improve-ai-evals

sahilds1 commented Apr 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sahilds1 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Manual Tests

Automated Tests

Documentation

Reviewers

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sahilds1 commented Apr 17, 2026 •

edited

Loading