LivChart AI Model Comparison: What We Learned from Chart Wizard Tests

When improving AI-assisted chart creation in LivChart, we do not evaluate model selection only by general chat quality. The real question for us is more specific: how consistently can a model turn a Turkish analytics request into the right chart family, the right metrics, the right filters, and a working preview?

For the latest model trials, we ran 12 different models through the same Chart Wizard health-check suite. The test set includes 32 scenarios across bar, line, area, pie, scatter, bubble, heatmap, funnel, waterfall, treemap, sunburst, radar, candlestick, map, histogram, box, violin, gauge, and combo chart families. It also checks more than first-pass chart generation: filter preservation, chart rebuild flows, and edit-mode behavior are part of the suite.

Test scope: 12 AI models, 32 Chart Wizard scenarios, Turkish user prompts, and real preview/render validation.

Evaluation axis: correct chart family, correct metric spec, filter consistency, renderable preview, and edit-mode stability.

Short Answer

The strongest overall result in this test round came from gemma4:31b-cloud. It completed 30 out of 32 scenarios successfully and produced the lowest number of failures. glm-5.1:cloud and kimi-k2.6:cloud followed closely with 29/32 successful scenarios. On speed, gpt-oss:120b-cloud was very strong, but it made more chart-family mistakes, so it ranked behind the accuracy leaders overall.

Our practical takeaway is simple: in an analytics product like LivChart, model choice cannot be based on response speed alone. Metric spec consistency, avoiding unexpected filter changes, and choosing the right chart family are just as important as latency.

The top line is clear

Accuracy leader: gemma4:31b-cloud, with 30/32 successful scenarios.
Balanced candidate: glm-5.1:cloud, with 29/32 success and 3.17 sec average unit duration.
Speed candidate: gpt-oss:120b-cloud, with 1.85 sec average unit duration, but 5 failures.

Success view
gemma4:31b-cloud        30/32  ██████████████████████████████
glm-5.1:cloud           29/32  █████████████████████████████
kimi-k2.6:cloud         29/32  █████████████████████████████
deepseek-v4-flash       28/32  ████████████████████████████
gpt-oss:120b-cloud      27/32  ███████████████████████████
gemini-3-flash-preview  24/32  ████████████████████████
ministral-3:14b-cloud    9/32  █████████

How We Tested

The results are based on the latest Chart Wizard full-test reports from April 29-30, 2026. Each model was tested on the same sales fixture data, using Turkish prompts and the same 32-scenario health-check suite. For each scenario, we checked whether the model output could be converted into a metric spec, whether the preview rendered, whether the expected chart family was selected, whether filters were preserved, and whether edit/rebuild flows behaved correctly.

This is not a general AI benchmark. It is a product-level comparison for LivChart's real chart-generation workflow. In our context, the "best model" is not the most popular model in general; it is the model that requires the least manual correction inside a BI product.

Important distinction: This is not a general LLM leaderboard. It is a product-focused test for chart generation, preview, and edit workflows inside LivChart Chart Wizard.

Model Comparison Summary

Group 1: Closest to production use

gemma4:31b-cloud: 30/32 successful. 2 failures. Total duration 379 sec, average unit duration 5.27 sec. The strongest accuracy result.
glm-5.1:cloud: 29/32 successful. 3 failures. Total duration 228 sec, average unit duration 3.17 sec. A strong balance of accuracy and speed.
kimi-k2.6:cloud: 29/32 successful. 3 failures. Total duration 538 sec, average unit duration 7.48 sec. Good accuracy, but higher latency.

Group 2: Good models that still need guards

deepseek-v4-flash:cloud: 28/32 successful. 4 failures. Total duration 376 sec, average unit duration 5.23 sec. Promising, but consistency should be monitored.
minimax-m2.7:cloud: 28/32 successful. 4 failures. Total duration 434 sec, average unit duration 6.03 sec. Generally good, but struggled with heatmap/funnel decisions.
nemotron-3-super:cloud: 28/32 successful. 4 failures. Total duration 156 sec, average unit duration 2.16 sec. Fast and usable, with some chart-family drift.
qwen3.5:cloud: 28/32 successful. 4 failures. Total duration 516 sec, average unit duration 7.16 sec. Medium-to-good accuracy with higher latency.
gpt-oss:120b-cloud: 27/32 successful. 5 failures. Total duration 133 sec, average unit duration 1.85 sec. Very fast, but made chart-family mistakes in scatter, heatmap, and funnel decisions.

Group 3: Weaker results in this round

deepseek-v4-pro:cloud: 26/32 successful. 6 failures. Total duration 671 sec, average unit duration 9.32 sec. In this test, it was both slower and less accurate than the flash variant.
gemini-3-flash-preview:cloud: 24/32 successful. 8 failures. Total duration 241 sec, average unit duration 3.35 sec. Reasonable speed, but too many metric spec and preview render failures.
nemotron3:33b: 21/32 successful. 11 failures. Total duration 46 sec, average unit duration 0.64 sec. Very fast, but not reliable enough for the product flow because of metric spec gaps.
ministral-3:14b-cloud: 9/32 successful. 23 failures. Total duration 141 sec, average unit duration 1.96 sec. The clearest issue was unexpected filter generation.

Speed and accuracy together

Model                    Success  Avg duration  Short comment
gemma4:31b-cloud         30/32    5.27 sec      Best accuracy
glm-5.1:cloud            29/32    3.17 sec      Best balanced profile
gpt-oss:120b-cloud       27/32    1.85 sec      Very fast, needs more guards
nemotron-3-super:cloud   28/32    2.16 sec      Fast, some family drift
deepseek-v4-pro:cloud    26/32    9.32 sec      Behind the flash variant in this round

What the Failure Patterns Tell Us

When we reviewed the failed scenarios one by one, the failures were not random. The first major category was preview_render. In these cases, the model produced a spec, but the preview engine could not generate renderable data for the requested chart. Radar and candlestick scenarios triggered this type of issue most often.

The second major category was chart_family. These are cases where the model understood the broad intent but selected the wrong chart family. For example, some models chose bar or waterfall when the expected flow visualization was funnel. Similarly, in country/productline intensity scenarios where heatmap was expected, some models selected bar charts.

The third and most disruptive category was metric_spec_missing. In these cases, the structural spec header required by Chart Wizard was missing, so the preview workflow could not start. For LivChart, this is not only a quality issue; it is a user-experience interruption.

Ministral showed a different pattern: it added unexpected filters in many scenarios. This is especially risky in analytics products. A chart may still render, but the data scope may silently change without the user noticing.

Most critical failure type: Unexpected filter generation. Even if the chart renders, the user may reach the wrong conclusion because the data scope has changed.

Main failure types

preview_render: A spec exists, but the preview cannot produce renderable data.
chart_family: The model selects the wrong chart family, such as bar or waterfall instead of funnel.
metric_spec_missing: The structural metric spec is missing, so the flow cannot begin.
unexpected_filters: The model adds filters the user did not request and changes the data scope.

The Practical Decision for LivChart

The first decision from this test set is that model selection should be split into two categories. The first is the default production model, where accuracy and spec consistency matter most. The second is a fast alternative model, where latency can be prioritized, but chart-family guards and automatic correction need to be more active.

From this perspective, gemma4:31b-cloud stands out as the accuracy leader. glm-5.1:cloud is a strong candidate because of its more balanced speed/accuracy profile. gpt-oss:120b-cloud is valuable for flows where fast response matters, but using it as the default model without chart-family guards would be risky.

LivChart's approach is not to depend blindly on one model. In AI-assisted BI products, the more robust architecture is to validate model output through metric spec checks, preview rendering, filter preservation, and chart-family guards before showing the result to the user. When a model makes a mistake, LivChart can trigger automatic correction or a safe fallback instead of showing a broken chart.

Recommended usage matrix

Default production candidate: gemma4:31b-cloud.
Balanced alternative: glm-5.1:cloud.
Speed-focused alternative: gpt-oss:120b-cloud, with chart-family guards enabled.
Models to keep watching: deepseek-v4-flash:cloud, nemotron-3-super:cloud, minimax-m2.7:cloud.

Next Step

In the next test round, we will put more weight on radar, candlestick, heatmap, and funnel scenarios. These chart families require the model not only to select the right columns, but also to read the analytical intent correctly and follow LivChart's metric spec contract precisely.

This test round made one thing clear: AI quality in BI products is not measured only by fluent text. The model must connect the right data to the right chart family with the right filters and a renderable structural spec. That is why LivChart evaluates model performance inside the actual product workflow, including working previews and edit scenarios.