Open Source Model Benchmark: Financial Function Calling Polygon.io Testset 20240425–1

Testset: 20240423–1 Financial API Function Calling
Test Plan Details
Use LLM to generate function call from the context. Datasets contain question and answer pairs which answers should be clear and unique and verifiable. Each question will be asked multiple times and correctness is calculate based on models’ answers. Datasets and prompts might be modified to achieve better results.
Test Set Details
Use LLMs to generate Polygon.io function call endpoint based on prompted questions and instructions. Question set has been revised with detailed instruction about date format and meaning of multiplier and timespan. Total 20 questions.
Date Published 2024–05–07
Methodology
See our blog post to know more about our methodology
Key Finding
Click the following link to view results.
- qwen:32b beats yi:34b and command-r:35b
- llama3:8b q4, q6, q8, and fp16 quantization perform almost the same, q2 lower
- qwen:110b q4 and q2 beat large models command-r-plus:104b and dbrx:132b
- gemma:2b-v1.1 q8 beats fp16 and q4
- gemma:7b-v1.1 q8 and fp16 quantization perform almost the same, q4 lower, q2 lowest
- gemma:7b-v1.0 beats gemma:7b-v1.1
- mixtral:8x7b beats mixtral:8x22b
- wizardlm2:8x22b MoE beatsmixtral:8x7b and 8x22b
- wizardlm2:8x22b MoE q8 quantization is not as good as q4
- dolphin-llama3 fine-tuned downgrade the performance
- llama3:70b quantization makes little difference, q2 lower
- llama3:70b family beat other 70b models
- gemma:7b-v1.1, mistral:7b, openchat:7b, and llam3:8b beat other 7b and 8b models
- qwen:110b are better than 32b and other sizes, 32b beats 72b
- llama3:70b is the best performer, followed by wizardlm2:8x22b and qwen:110b
Datasets
Polygonio API Function Calling Q&A dataset
Contains questions and answer to call Polygonio API function. Answers are simple and can be verified by program. Generated by ChatGPT.
Prompt Template
You are a financial application developer. You will question below is asking about Polygon.io API endpoint. Answer the question below based on instruction provided. If you don’t know the answer, just answer you don’t know.
Question: %question%
Instruction: %instruction%
multiplier is an integer number. timespan can be one of minute, hour, day, week, month, year. from and to are date in the format yyyy-mm-dd. Response only in JSON format as {“answer”: “some endpoint”} without any other text. No explanation is required.
No host name http://hostname required. Answer Like {“answer”:”/v2/somefunction”}
Results by Models
Test Results
Display the first 100 rows. Full data can be downloaded from our GitHub repository