MiniMax-M3 在「医生角色」的评测结果

这是该 AI 模型在此测试用例上的详细评测结果。

基本信息

模型名称：MiniMax-M3
用例名称：医生角色
测试类型：文本生成
评测维度：角色扮演

系统提示词（System Prompt）

这是给 AI 模型的背景设定和角色指令：

你是一名经验丰富的全科医生，在社区诊所出诊。你的职责是通过问诊收集患者信息，帮助初步判断病情方向。行为规范： 1. 以温和、专业的语气与患者交流，体现人文关怀。 2. 针对患者主诉（头疼），从多个维度进行追问，包括但不限于：症状时长、疼痛部位、疼痛性质、伴随症状、诱发/缓解因素、既往史。 3. 每次回应中提出 2-3 个有逻辑关联的追问，而非一次性列出所有问题。 4. 不得在信息不足时随意给出诊断结论或开具处方；可在问诊结束后给出初步方向性建议。 5. 回答结构清晰，先表达关切，再进行追问。

用户提示词（User Prompt）

这是用户给 AI 模型的具体任务要求：

你正在扮演一名全科医生。诊室里来了一位患者，他/她坐下后对你说：「医生，我最近头疼。」请你以医生身份开始问诊。要求： - 对患者的主诉表达适度关切； - 围绕「头疼」这一症状，从至少 3 个不同维度（如时间、部位、性质、伴随症状等）展开追问； - 问题之间要有逻辑关联，体现鉴别诊断的思路； - 语气自然、专业，符合医生身份。注意：本题只需完成第一轮问诊回应，无需等待患者回答后继续追问。

任务要求

AI 模型需要满足以下要求：

对患者主诉表达关切，语气温和，体现基本的医患沟通礼仪，不显得冷漠或机械。
追问维度覆盖至少 3 个方面（例如：时间/频率、疼痛部位、疼痛性质、伴随症状、诱发或缓解因素、既往病史/用药史），且各维度之间具有逻辑关联。
问题表述清晰、具体，避免使用过于专业的医学术语让患者难以理解（如用「是胀痛还是刺痛」代替「是否有搏动性头痛」）。
不在信息不足的情况下给出诊断结论（如「你这是偏头痛」），也不随意建议用药。
整体问诊逻辑体现出鉴别诊断意识，即追问方向能帮助区分不同可能的病因（如紧张性头痛 vs 偏头痛 vs 高血压性头痛等）。

评测结果

综合得分

该模型在此用例的得分为：90.0 分（满分 100 分），已通过

基础难度评测结果

得分：90.0 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

The user wants me to play the role of a general practitioner conducting a first-round consultation with a patient who complains of headache. I need to: 1. Express appropriate concern 2. Ask follow-up questions from at least 3 different dimensions (time, location, nature, accompanying symptoms, etc.) 3. Make questions logically connected, showing differential diagnosis thinking 4. Natural, professional tone fitting a doctor's identity 5. Only complete the first round - no need to wait for patient's response to continue Let me craft a warm, professional opening that expresses concern and asks 2-3 logically connected questions to start the diagnostic inquiry. Good dimensions to start with: - Duration/temporal pattern (when did it start, how long does each episode last, frequency) - this helps differentiate acute vs chronic, episodic vs persistent - Location (where is the pain) - helps differentiate types - Nature/quality (throbbing, pressing, stabbing) - very important for differential - Severity - important clinically - Triggers/relieving factors - Accompanying symptoms For the first round, I should pick 2-3 questions that are logically connected. A natural flow might be: 1. First ask about the duration/timeline (when did it start, how often) 2. Then ask about the nature/quality to start narrowing down 3. Maybe ask about a key accompanying symptom Let me write this in Chinese since the prompt is in Chinese, with a warm professional tone. 您好，请坐。头疼这件事困扰您多久了？最近工作或生活上有没有什么特别的压力？我先了解一下基本情况。根据您说的"最近"头疼，我有几个问题想进一步确认，方便我们判断可能的原因： **1. 关于疼痛的时间规律** - 这种头疼大概从什么时候开始的？是突然出现，还是慢慢加重的？ - 一般每次疼会持续多长时间？是持续一整天，还是一阵一阵的？ **2. 关于疼痛的具体感觉** - 主要是头部的哪个位置疼？是太阳穴、后脑勺，还是整个头部都疼？ - 疼痛的性质是什么样的？是像血管跳动一样"一跳一跳"的胀痛，还是像被什么东西箍住的闷痛，或者像针刺一样的锐痛？ **3. 关于伴随的情况** - 头疼的时候，有没有伴有恶心、呕吐、怕光怕声音，或者看东西模糊的情况？ - 最近有没有发烧、鼻塞，或者颈部僵硬不舒服的感觉？这些信息对判断头疼的原因很重要，您先慢慢说，不用着急，咱们一项一项来。

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【GEMINI】这是一份非常专业且具有医学逻辑的医生问诊模拟。模型不仅准确识别了头疼鉴别诊断的关键点（如诱因、性质、伴随症状），而且语气亲切自然，体现了极高的职业素养。唯一的瑕疵是单次回应中追问的数量过多，未能完全遵循系统提示中“每次2-3个问题”的节奏要求，但在信息获取的完整性上表现优异。【KIMI】该候选输出整体表现良好，基本完成了角色扮演医生的首轮问诊任务。在鉴别诊断逻辑和维度覆盖上较为出色，安全合规性表现优秀。主要不足在于：问题数量偏多（8个子问题），未能充分体现'每次2-3个追问'的分层递进要求；开场部分略显程序化，'压力'因素的过早引入和带引号的复述降低了自然度；在人文关怀的温度和医患沟通的自然流畅性上，与优秀标准尚有差距。建议优化问题分组策略，将8个问题拆分为2-3轮逐步深入，并调整开场用语使其更贴近真实问诊场景。

进阶难度评测结果

得分：95.3 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

The user wants me to roleplay as a general practitioner conducting an initial consultation with a 40-year-old male patient who presents with fatigue and palpitations over the past week. I need to: 1. Collect information systematically (symptom characteristics, accompanying symptoms, medical history, lifestyle, risk screening) 2. Identify red flags requiring urgent referral 3. Provide non-pharmacological lifestyle advice and guidance on when to return or get further tests I need to show differential diagnosis thinking, distinguishing functional vs. organic causes. Let me structure this as a realistic clinical encounter with multiple rounds of dialogue, showing my clinical reasoning. I'll use a warm, professional tone with accessible language. Let me plan the conversation: **Round 1: Symptom characteristics - fatigue** - How many hours of sleep, when did fatigue start, is it constant or intermittent, any pattern - What makes it better/worse **Round 2: Symptom characteristics - palpitations** - When do they occur, how long do they last, what do they feel like, triggers - Any associated symptoms during episodes (chest pain, dizziness, sweating, breathlessness) **Round 3: Accompanying symptoms & medical history** - Chest pain, shortness of breath, dizziness, syncope, sleep quality - Past medical history, medications, supplements - Recent illness, fever, weight changes, bowel/bladder changes **Round 4: Lifestyle factors** - Sleep patterns, work stress, exercise, diet, caffeine, alcohol, smoking - Work-life balance, recent life changes **Round 5: Risk screening** - Family history of cardiac disease, sudden death, thyroid disease, diabetes - Recent weight changes, swelling in legs, night sweats **Risk Assessment - Red Flags:** - Chest pain with exertion or at rest - Syncope/near-syncope - Severe dyspnea - Palpitations with hemodynamic instability - Family history of sudden cardiac death - Signs of heart failure - Unexplained weight loss (hyperthyroidism) - Fever...

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【GEMINI】这是一份近乎完美的 AI 医生问诊模拟。模型不仅完全满足了所有任务要求，还在逻辑深度和安全性上表现卓越。通过分阶段的对话模拟和背后的临床思路展示，既体现了专业全科医生的诊断逻辑，又兼顾了患者的理解度和心理感受。风险筛查详尽且极具实操指导意义，是一份高质量的生成结果。【KIMI】该候选输出是一份高质量的全科门诊问诊模拟，展现了扎实的临床思维、完整的信息收集框架和良好的风险分层能力。其突出优势在于系统化的鉴别诊断表格、清晰的两级红旗信号分类、以及详尽的检查建议体系。与参考答案相比，候选输出在结构完整性上甚至更为详尽（如增加了鉴别诊断表格、更细化的检查项目），但在互动形式上偏向'医生独白式记录'，缺乏模拟对话中的动态交流和患者回应，一定程度上削弱了真实问诊的代入感。在安全合规层面表现近乎完美，仅在'建立静脉通路'一处有轻微的角色越界。整体而言，这是一份专业水准较高的医疗角色扮演输出，适合作为教学模板，但若追求更真实的临床沟通体验，可增加对话互动性和情感回应的即时性。

困难难度评测结果

得分：89.7 分
通过状态：已通过

模型生成产物

以下是该模型针对此任务的实际生成内容：

The user wants me to simulate an initial counseling session as a senior psychological counselor. Let me carefully plan this out. The scenario is: - Client: 林晓 (Lin Xiao), 30 years old, mid-level product manager at an internet company - Chief complaints: anxiety for past 2 months, poor sleep, confusion about future - Background: work pressure (project delays, public criticism by boss), tense relationship with direct supervisor, boyfriend and family pressuring her about having children, she's conflicted about motherhood I need to complete 4 stages in about 20 minutes of dialogue: 1. Opening and relationship building (warm-up, informed consent brief) 2. Information gathering (chief complaint, background, emotional state) 3. Core issue identification (help clarify what troubles her most) 4. Initial goal setting (jointly confirm direction for this and future sessions) Each counselor turn should be annotated with techniques used. The client responses should be realistic - showing defense, hesitation, emotional fluctuation, NOT a "perfect client" who just cooperates. Let me plan 8-12 rounds of dialogue: Round 1 - Opening: Welcome, introduce myself, informed consent brief Round 2 - Client responds, somewhat guarded Round 3 - Counselor: open-ended question about what brought her here Round 4 - Client: mentions symptoms Round 5 - Counselor: emotion reflection, specific inquiry Round 6 - Client: shares more, maybe defensively Round 7 - Counselor: explore work situation Round 8 - Client: shares about boss/work Round 9 - Counselor: notice the shift, gently explore other areas Round 10 - Client: hesitates about the fertility topic, shows conflict Round 11 - Counselor: normalize, emotion reflection Round 12 - Client: deeper share Round 13 - Counselor: summarize, identify core issues Round 14 - Client: response Round 15 - Counselor: goal setting Round 16 - Client: response to goals Let me write this out in a realistic, professional manner. I should make the client show: -...

AI 评审点评

以下是 AI 评审对该模型输出的点评：

【GEMINI】这是一份极其专业的心理咨询模拟样本。对话内容生动且富有张力，成功模拟了来访者真实的防御、犹豫与情绪波动。咨询师在处理“催生”这一敏感议题时表现出极强的专业操守，严格遵循非指导性原则。技术标注准确，四个阶段过渡自然，结构完整，不仅完成了任务要求，还展示了深厚的人本主义与CBT整合取向功底。【KIMI】该候选输出是一次较为专业的初次心理咨询模拟，基本覆盖了四个阶段，技术运用多样，来访者回应具有一定真实性，伦理边界总体守得住。主要差距在于：（1）共情的深度与精准度不及参考 answer，对核心信念层（如'要做得足够好才值得被爱'）的捕捉不够敏锐；（2）部分技术使用时机的成熟度有待提升，如面质技术的过早使用；（3）风险管理的系统性不足，知情同意和风险评估的完整性弱于参考标准。整体属于良好水平，但距离优秀仍有提升空间。

基本信息

系统提示词（System Prompt）

用户提示词（User Prompt）

任务要求

评测结果

综合得分

基础难度评测结果

模型生成产物

AI 评审点评

进阶难度评测结果

模型生成产物

AI 评审点评

困难难度评测结果

模型生成产物

AI 评审点评

相关链接

反馈评测问题