Skip to content

[#490] Improve AI Evals#499

Merged
sahilds1 merged 10 commits into
CodeForPhilly:developfrom
sahilds1:490-improve-ai-evals
Jun 19, 2026
Merged

[#490] Improve AI Evals#499
sahilds1 merged 10 commits into
CodeForPhilly:developfrom
sahilds1:490-improve-ai-evals

Conversation

@sahilds1

@sahilds1 sahilds1 commented Apr 17, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests

Changes:

  • Extracted assistant orchestration into assistant_services.py and tool-loop/retrieval logic into tool_services.py, updating the DRF view to call run_assistant.
  • Added an eval_assistant.py script (CSV output) and a small notebook for side-by-side response review.
  • Added focused unit tests for the new service modules and adjusted an existing uploadFile test import.

Related Issue

This PR relates to #490

Manual Tests

Screenshot 2026-06-19 at 1 23 35 PM

Automated Tests

Ran the api/views/assistant/ test suite:

sahildshah•~/github/balancer-main(490-improve-ai-evals⚡)»   docker compose exec backend pytest api/views/assistant/                                                      [13:25:56]
================================================================================ test session starts ================================================================================
platform linux -- Python 3.11.4, pytest-9.0.2, pluggy-1.6.0
django: version: 4.2.3, settings: balancer_backend.settings (from ini)
rootdir: /usr/src/server
configfile: pytest.ini
plugins: django-4.12.0, anyio-4.12.1
collected 15 items

api/views/assistant/test_assistant_services.py ....                                                                                                                           [ 26%]
api/views/assistant/test_eval_assistant.py .                                                                                                                                  [ 33%]
api/views/assistant/test_tool_services.py ..........                                                                                                                          [100%]

================================================================================ 15 passed in 2.12s =================================================================================

Run the full test suite:

sahildshah•~/github/balancer-main(490-improve-ai-evals⚡)» docker compose exec backend pytest                                                                             [13:33:09]
================================================================================ test session starts ================================================================================
platform linux -- Python 3.11.4, pytest-9.0.2, pluggy-1.6.0
django: version: 4.2.3, settings: balancer_backend.settings (from ini)
rootdir: /usr/src/server
configfile: pytest.ini
plugins: django-4.12.0, anyio-4.12.1
collected 44 items

api/services/test_embedding_services.py ...................                                                                                                                   [ 43%]
api/views/assistant/test_assistant_services.py ....                                                                                                                           [ 52%]
api/views/assistant/test_eval_assistant.py .                                                                                                                                  [ 54%]
api/views/assistant/test_tool_services.py ..........                                                                                                                          [ 77%]
api/views/uploadFile/test_title.py ..........                                                                                                                                 [100%]

================================================================================ 44 passed in 2.02s =================================================================================

Documentation

Reviewers

Notes

@sahilds1 sahilds1 self-assigned this Apr 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This WIP PR refactors the assistant endpoint into smaller service modules (assistant orchestration + tool loop + prompts) and adds an evaluation script plus unit tests to support ongoing AI evals work for #490.

Changes:

  • Extracted assistant orchestration into assistant_services.py and tool-loop/retrieval logic into tool_services.py, updating the DRF view to call run_assistant.
  • Added an eval_assistant.py script (CSV output) and a small notebook for side-by-side response review.
  • Added focused unit tests for the new service modules and adjusted an existing uploadFile test import.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
server/api/views/uploadFile/test_title.py Switches to absolute import for title to make the test importable under pytest.
server/api/views/assistant/views.py Simplifies the APIView by delegating assistant execution to run_assistant.
server/api/views/assistant/tool_services.py Introduces the search tool schema, tool mapping, and the reasoning/tool-call loop helpers.
server/api/views/assistant/assistant_services.py Adds run_assistant orchestrator wiring OpenAI client + tools + loop.
server/api/views/assistant/assistant_prompts.py Moves the long assistant instruction prompt into a dedicated module constant.
server/api/views/assistant/eval_assistant.py Adds a terminal-run evaluation script that runs questions concurrently and writes CSV results.
server/api/views/assistant/review.ipynb Adds a lightweight notebook to compare two eval result CSVs side-by-side.
server/api/views/assistant/test_tool_services.py Adds unit tests for tool mapping, function-call dispatch, and the reasoning loop behavior.
server/api/views/assistant/test_assistant_services.py Adds unit tests validating run_assistant request shaping and previous-response handling.
server/api/views/assistant/test_eval_assistant.py Adds a unit test ensuring eval rows capture exceptions instead of aborting the batch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 40 to +42
message = request.data.get("message", None)
previous_response_id = request.data.get("previous_response_id", None)

# Track total duration and cost metrics
start_time = time.time()
total_token_usage = {"input_tokens": 0, "output_tokens": 0}

if not previous_response_id:
response = client.responses.create(
input=[
{"type": "message", "role": "user", "content": str(message)}
],
**MODEL_DEFAULTS,
)
else:
response = client.responses.create(
input=[
{"type": "message", "role": "user", "content": str(message)}
],
previous_response_id=str(previous_response_id),
**MODEL_DEFAULTS,
)

# Accumulate token usage from initial response
if hasattr(response, "usage"):
total_token_usage["input_tokens"] += getattr(
response.usage, "input_tokens", 0
)
total_token_usage["output_tokens"] += getattr(
response.usage, "output_tokens", 0
)

# Open AI Cookbook: Handling Function Calls with Reasoning Models
# https://cookbook.openai.com/examples/reasoning_function_calls
while True:
# Mapping of the tool names we tell the model about and the functions that implement them
function_responses = invoke_functions_from_response(
response, tool_mapping={"search_documents": search_documents}
)
if len(function_responses) == 0: # We're done reasoning
logger.info("Reasoning completed")
final_response_output_text = response.output_text
final_response_id = response.id
logger.info(f"Final response: {final_response_output_text}")
break
else:
logger.info("More reasoning required, continuing...")
response = client.responses.create(
input=function_responses,
previous_response_id=response.id,
**MODEL_DEFAULTS,
)
# Accumulate token usage from reasoning iterations
if hasattr(response, "usage"):
total_token_usage["input_tokens"] += getattr(
response.usage, "input_tokens", 0
)
total_token_usage["output_tokens"] += getattr(
response.usage, "output_tokens", 0
)

# Calculate total duration and cost metrics
total_duration = time.time() - start_time
cost_metrics = calculate_cost_metrics(
total_token_usage, GPT_5_NANO_PRICING_DOLLARS_PER_MILLION_TOKENS
)

# Log cost and duration metrics
logger.info(
f"Request completed: "
f"Duration: {total_duration:.2f}s, "
f"Input tokens: {total_token_usage['input_tokens']}, "
f"Output tokens: {total_token_usage['output_tokens']}, "
f"Total cost: ${cost_metrics['total_cost']:.6f}"

Comment on lines +11 to +13
# uv script (or plain Python) to generate results to CSV, run from the terminal
# Run from inside the container: docker compose exec backend python eval_assistant.py
#
Comment on lines +31 to +34
from api.views.assistant.assistant_services import run_assistant
# TODO: remove unused import or use INSTRUCTIONS to record an instructions_hash column
from api.views.assistant.assistant_prompts import INSTRUCTIONS

Comment on lines +120 to +122
"""Extract all function calls from the response, look up the corresponding tool function(s) and execute them.
(This would be a good place to handle asynchroneous tool calls, or ones that take a while to execute.)
This returns a list of messages to be added to the conversation history.
Comment on lines +200 to +204
# Mapping of the tool names we tell the model about and the functions that implement them
function_responses = invoke_functions_from_response(response, tool_mapping)
if len(function_responses) == 0: # We're done reasoning
logger.info("Reasoning completed")
final_response_output_text = response.output_text
@sahilds1 sahilds1 changed the title [WIP] [#490] Improve AI Evals [#490] Improve AI Evals Jun 19, 2026
@sahilds1 sahilds1 marked this pull request as ready for review June 19, 2026 18:13
@sahilds1 sahilds1 merged commit 3ad5a1e into CodeForPhilly:develop Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants