fix: resolve all critical runtime errors and bugs from audit

- Add COMPLETIONS_API_KEY to config.py (env var + auto-generated fallback) - Fix perplexity auto-search: upstream sends logprobs=true, parse_llama_stream_chunk extracts per-token logprobs, all_logprobs populated during streaming - Fix all /api/models endpoints to target LLAMA_SERVER_BASE (port 8081) not OLLAMA_BASE - Fix RAG embedding endpoint URL from port 11434 (Ollama) to 8081 (llama-server) - Correct misleading error messages: 'inference server' not 'Ollama' - Remove raw_results leak from SSE event stream in /api/search - Fix weather query extractor: pattern-match instead of unconditional suffix append - Escape FTS5 operator keywords (AND/OR/NOT/NEAR) in memory search - Move auth.py BODY_LIMIT_DEFAULT_BYTES imports to module level - Change RAG injection log level from warning to info - Fix all 8 test files after modular refactor (rewire imports from correct modules) - Update AGENTS.md and README.md to reflect v1.8.0 changes
2026-06-27 15:10:32 -07:00
parent 41a8708c0d
commit 193829b7ff
20 changed files with 457 additions and 896 deletions
--- a/routers/completions.py
+++ b/routers/completions.py
@@ -178,7 +178,7 @@ async def _stream_chat(payload: dict, model: str, conv_id: str, request: Request
                async for line in resp.aiter_lines():
                    if not line.strip():
                        continue
-                    token, done, _ = parse_llama_stream_chunk(line)
+                    token, done, _, _ = parse_llama_stream_chunk(line)
                    if token:
                        full_response.append(token)
                        yield _build_openai_chunk(token, model, conv_id)
@@ -222,7 +222,7 @@ async def _blocking_chat(payload: dict, model: str, conv_id: str, request: Reque
                async for line in resp.aiter_lines():
                    if not line.strip():
                        continue
-                    token, done, _ = parse_llama_stream_chunk(line)
+                    token, done, _, _ = parse_llama_stream_chunk(line)
                    if token:
                        full_response.append(token)
                    if done: