In-House Legal Bench: Evaluating AI Assistants for In-House Legal Work

Read time: ...

We are excited to announce the In-House Legal Bench, a new comprehensive benchmark designed to evaluate AI performance on legal tasks that reflect the day-to-day work of in-house lawyers.

The benchmark tests whether the AI tools tested, such as GC AI, produce correct, well-structured legal analysis and work product by grading their responses against attorney-developed answer keys. We use the results to drive product improvements for GC AI and to show how it compares to general-purpose AI assistants. 

For this benchmark, we evaluated GC AI against three leading AI assistants: ChatGPT (GPT-5.5), Claude (Opus 4.7), and Gemini (3.1 Pro). GC AI outperformed all three across every legal task category, leading the closest competitor, ChatGPT, by 7 percentage points overall.

Our Approach

Focusing on tasks reflecting real in-house legal work, designed by experienced practitioners, in end-user applications

The prompts we used in this benchmark represent the tasks and issues that in-house lawyers are expected to handle and oversee. In-house lawyers - counsel that work for companies as salaried employees, as opposed to lawyers employed by law firms and who typically bill by the hour - serve as primary advisors to the company and often juggle many tasks, including conducting legal research, drafting legal documents, advising on risks associated with new products or services, and managing regulatory compliance programs. They value advice that is direct, actionable, and business-focused because their goals are the company's business goals. That perspective - of the law and legal strategy as an input to company strategy - shapes the questions that in-house lawyers pose and the nature of their tasks and workflows.

Designing and reviewing evaluations that reflect these daily realities takes skill and experience. Our team of six R&D Attorneys with a combined 80+ years of professional experience at leading companies and law firms developed the materials used in this benchmark. Every task and answer key was created and vetted by this team, then implemented in close partnership with GC AI's Applied AI engineering team. 

Tasks and Prompts

Designing tasks and prompts

We developed 100 unique in-house legal tasks for this benchmark, which fall into the following 10 categories: 

Category

Example Task

Drafting

Drafting a jurisdiction-compliant return-to-office policy for employees transitioning back from remote work

Summarizing Documents

Producing an executive briefing of recent trial opinion, distilling key findings and implications from the ruling

Contract Analysis

Explaining the scope and restrictions of an IP license agreement in plain English

Legal Research

Describing SEC Schedule 13D beneficial ownership reporting requirements and recent amendments

Legal Strategy

Assessing CPSC reporting obligations and recall options for a smart home device

Risk Assessment

Identifying risks in an arbitration clause with no IP carve-outs in a supply agreement

Comparison / 

Benchmarking

Comparing supplier codes of conduct across several direct competitors

Extracting Information/Data

Extracting executive compensation data from a company’s proxy statement into a table

Regulatory Tracking

Mapping federal consumer protection rules and analogous state laws into a compliance chart

Checklists

Producing a GDPR compliance checklist for a SaaS company

Based on the nature of the legal tasks, some were classified into more than one category. For example, a drafting task may involve legal research and would be classified as both a drafting task and a legal research task. 

The tasks also cover a wide range of legal topics and domains that fall within an in-house lawyer's scope, including commercial transactions, consumer protection and product safety, corporate and securities work, employment and labor law, intellectual property, international and cross-border issues, litigation and dispute resolution, privacy and data protection, regulatory compliance, and sustainability. We recognize that the nature of in-house legal work spans beyond these topics and domains, and we expect to expand the areas our tasks cover in subsequent studies. 

Task Structure

Each task consists of a prompt (what we ask the AI assistant to do) and applicable source documents. These could be contracts, policies, or other additional materials that the AI assistant  would typically be given to process the request, provided as URLs, PDFs, or Word files.

We drafted the prompts to reflect how in-house lawyers would query AI tools using concise, natural-language requests. Following this approach, all evaluations in this benchmark consist of a single-turn conversation: a prompt related to the task followed by a response from GC AI (or AI assistant). We did not simplify the tasks we created to make them easier for a tool to complete in a single turn. 

Evaluation Methodology

Scoring Rubric

Each task has a distinct answer key, a structured list of criteria that specifies the elements and characteristics of a high-quality, correct response (e.g., correct facts, accurate legal analysis, appropriate language). The answer keys for all our tasks were created by lawyers to be legally accurate and representative of responses that skilled and qualified in-house lawyers would produce or expect to see. In addition to task-specific criteria, several baseline response-quality criteria, which assess for appropriate depth, professional tone, and actionability, are automatically applied to every task. Combined, the answer keys average 12 criteria per task, totaling over 1,200 criteria across our 100 legal tasks.

Each criterion in an answer key describes what the response must include or provide to pass. For example, an answer key may require a response to correctly identify the effective date of Delaware’s new pay transparency law. Including the wrong effective date for this pay transparency law, or omitting a date entirely, would be considered a fail. In another example, a different answer key may require the response to include a detailed pros and cons analysis of three litigation response options, with at least two pros and two cons for each, grounded in the specific facts of the complaint.

A response that offered only generic strategy advice or a cursory analysis, without case-specific details, would fail. The answer keys use clear, concrete language to describe what the response must include to pass and provide explicit quantifiers (e.g., “at least two of”, “and”) when a criterion requires multiple elements for the response to pass. 

Since responses also need to be concise and practical, not just accurate, the baseline response-quality criteria, scored on every task, evaluate a response’s substance, tone, and presentation. Does the response reflect what an in-house lawyer would consider to be practical, skillful, and polished work product? An in-house lawyer who simply asks for an initial draft policy to review is unlikely to want a detailed supplemental memo explaining the policy’s components.

Scoring Responses

We used an LLM-as-judge (“LLM Judge”) to score the responses generated by GC AI and the AI assistants for each of the prompts. The LLM Judge was given instructions on how to grade the responses, including instructions on how to interpret and apply the answer keys from an in-house perspective. In addition, for criteria containing multiple elements (e.g., X and Y), the judge was instructed to check that the response contained all required elements to issue a pass. The LLM Judge evaluated the user-facing text response and the artifacts (e.g., a drafted email or generated document) that GC AI or the AI assistants generated for the prompt.

The LLM judge graded each response against the answer key and produced a structured output with criteria-level binary pass/fail verdicts. These were supported by collected evidence and reasoning so our attorneys could review the LLM Judge’s performance. 

Using the pass/fail verdicts, we calculated task pass rates by legal task category (e.g., contract analysis, drafting, etc.) as well as an overall pass score across all tasks. 

To understand how the LLM Judge’s performance compared to human expert review and address potential concerns about LLM-as-judge reliability, we also manually scored a set of responses that had been previously scored by the LLM Judge. Using this data, we determined there was substantial alignment between LLM Judge scoring and human expert scoring, giving us confidence in our scoring process.

Results: GC AI vs ChatGPT, Claude, and Gemini Across 10 Legal Task Categories

GC AI achieved an overall pass rate of 86.8% across all criteria scored over the 100 tasks in our benchmark, compared to 79.8% for ChatGPT, 68.4% for Claude, and 57.5% for Gemini. GC AI also outperformed all three AI assistants in each of the 10 categories of legal tasks that we measured, with the largest advantages appearing in research-intensive tasks.

Overall Pass Rates

Percentage of answer key criteria passed across 100 legal tasks.

AI Tool

Pass Rate

GC AI

86.8%

ChatGPT

79.8%

Claude

68.4%

Gemini

57.5%

Pass Rates by Legal Task Category

Criteria pass rates grouped by legal task category.

Legal Task Category

Tasks

GC AI

ChatGPT

Claude

Gemini

Drafting

19

87.6%

83.4%

74.9%

66.4%

Summarizing Documents

12

81.6%

77.5%

63.7%

57.5%

Contract Analysis

13

82.7%

72.8%

66.3%

42.9%

Legal Research

23

88.3%

75.6%

66.2%

61.7%

Legal Strategy

16

86.3%

84.5%

63.0%

58.0%

Risk Assessment

26

89.0%

84.2%

71.1%

59.2%

Comparison/Benchmarking

9

91.4%

84.7%

81.4%

72.9%

Extracting Information/Data

24

82.0%

76.9%

57.0%

56.3%

Regulatory Tracking

11

88.6%

73.5%

68.2%

45.0%

Checklists

13

89.9%

81.9%

73.4%

59.3%

Note: Some legal tasks are associated with more than one category.

GC AI showed its largest advantage over the AI assistants on tasks that require specialized research abilities (regulatory tracking, legal research, and checklists), where the tool needs to locate, synthesize, and organize current regulatory requirements across multiple jurisdictions into clear, actionable outputs for lawyers. GC AI’s approach led to more accurate and better-sourced regulatory analysis, and its stronger performance came from presenting findings more effectively and grounding them in authoritative sources (e.g., government, court, and regulatory authorities) that lawyers trust and rely on.

In drafting and contract analysis, GC AI outperformed all three AI assistants by meaningful margins. While ChatGPT showed comparable performance on issue-spotting, GC AI demonstrated clear advantages in legal accuracy, extracting quoted text from documents (using Exact Quote), and drafting higher quality responses and language. This was driven in part by GC AI’s approach to handling documents, where stronger document matching and analysis enabled it to produce output that was more precise, accurate, and useful.  

The narrowest gap appeared in legal strategy and risk assessment, where ChatGPT came closest to GC AI's performance. Claude and Gemini trailed both GC AI and ChatGPT by substantially wider margins in these categories. Despite the closer overall scores, GC AI still showed stronger performance in the quality and conciseness of its responses, including compared to ChatGPT.

Across the full benchmark, GC AI consistently produced responses that were useful, professional, and actionable. It provided material, relevant information without unnecessary detail or verbosity, demonstrating a clear pattern that held even in areas where other AI assistants matched or slightly exceeded its analytical scores.  

What These Results Mean for In-House Counsel    

These results demonstrate that a purpose-built legal AI platform like GC AI delivers measurable advantages over general-purpose AI assistants. While those AI assistants may demonstrate strong general reasoning capabilities, the work of in-house lawyers requires AI tools to deliver more than just reasoning. They must find and synthesize the right authoritative sources, produce responses and outputs that demonstrate an understanding of legal documents, and analyze and interpret legal issues in a manner that reflects precise legal thinking. And, importantly, they must be able to present their responses and deliverables in a way that resonates with and is trusted by lawyers. 

As AI assistants and foundation models continue to improve, the baseline for legal reasoning will rise across all platforms. This benchmark points to areas where GC AI can, and will, make deeper investment, such as reasoning support, analytical frameworks, and continued expansion of its agentic capabilities, to deliver a more powerful platform for its users. GC AI is built for in-house legal teams, and these results provide both validation of that approach and a roadmap for where to invest next.

Notes

Examples of our in-house legal tasks and answer keys are available upon request. Please reach out to us here. We look forward to working with partners in future benchmarking efforts.

Frequently Asked Questions

Which AI assistant scored highest on the In-House Legal Bench?

GC AI scored 86.8% across 100 in-house legal tasks, ahead of ChatGPT (GPT-5.5) at 79.8%, Claude (Opus 4.7) at 68.4%, and Gemini (3.1 Pro) at 57.5%. GC AI led in every one of the 10 task categories, with the largest margins in research-intensive tasks like regulatory tracking and legal research.

How well did ChatGPT perform for in-house legal work?

ChatGPT (GPT-5.5) scored 79.8% on the In-House Legal Bench, the highest of the three general-purpose AI assistants tested. ChatGPT came closest to GC AI in legal strategy and risk assessment, and showed comparable performance on issue-spotting. GC AI outperformed ChatGPT in legal accuracy, extracting quoted text from documents, and drafting quality.

Can general-purpose AI like ChatGPT, Claude, or Gemini handle in-house legal work?

It depends on the task. On the In-House Legal Bench, ChatGPT scored 79.8%, Claude 68.4%, and Gemini 57.5%, compared with GC AI at 86.8%. The largest gaps appeared in research-intensive tasks like regulatory tracking, legal research, and checklists, where general-purpose models trailed in grounding answers in authoritative sources.

Which AI is best for legal research?

GC AI scored highest in research-related tasks (regulatory tracking, legal research, and checklists) where its margin over ChatGPT, Claude, and Gemini was largest. The advantage came from grounding answers in authoritative sources.

Which AI is best for contract analysis?

GC AI outperformed ChatGPT, Claude, and Gemini in contract analysis and drafting tasks. ChatGPT showed comparable performance on issue-spotting, while GC AI demonstrated clear advantages in legal accuracy, extracting quoted text from documents using Exact Quote, and drafting higher-quality language.

What categories of legal tasks did the benchmark test?

Ten categories: drafting, summarizing documents, contract analysis, legal research, legal strategy, risk assessment, comparison and benchmarking, extracting information and data, regulatory tracking, and checklists. The categories cover in-house counsel work across a range of legal topics, including commercial, employment, IP, privacy, securities, and regulatory law.

How were the responses graded?

Each task had an attorney-built answer key averaging 12 pass/fail criteria to assess the accuracy and quality of responses . An LLM-as-judge scored every response against the rubric, and the GC AI R&D team manually validated a sample to confirm the LLM Judge's alignment with human expert review.

Listen to the Episode

Listen to the Episode

GC AI: Legal AI, for In-House

GC AI: Legal AI, for In-House

14 HRS

Saved per week per lawyer

21%

Greater accuracy than generalist AI

1,600+

In-house teams trust GC AI

Back To Top

Back To Top

Take the first step now.

Let’s explore about how we can make your life
as an in-house lawyer a whole lot easier.

Take the first step now.

Let’s explore about how we can make your life
as an in-house lawyer a whole lot easier.

Back To Top