Back to Jobs

Apply Now: Benchmarking Engineer (part-time)

Remote, USA Full-time Posted 2025-05-22

Your next career move could be with workwarp as a Benchmarking Engineer! Enjoy the freedom and flexibility of this Remote role. This position requires a strong and diverse skillset in relevant areas to drive success. We offer a clear and simple compensation of a competitive salary for this position.

Â


Â

About the Role

We are looking for a part-time engineer to maintain and improve our public LLM benchmarks at https://www.vals.ai. We expect you to have strong technical abilities, as well as a curiosity and interest in large language models...

Key responsibilities include:
• Creating new, private datasets in conjunction with our data annotators and our partner groups. You will have significant resources at your disposal to create these datasets.
• As new models come out, you will be responsible for running our existing benchmarks against them and compiling the results.
• Writing the free-text analyses of the raw quantitative results. Key questions that we currently seek to answer include: Which models make the most sense at a given price point? How has performance changed between different model generations? What kinds of the errors are the models liable to make?
• Creating Twitter and LinkedIn posts that describe the results of our findings.
• Writing and maintaining the scripts that we use to run benchmarks against our datasets. This work is critical to produce new results efficiently as models are released.

You will have significant ownership of our benchmarking site, which was featured in Bloomberg. You will also have the agency to propose new benchmarks, based on your own ideas and hypotheses.

Requirements
• Deep experience with Python, which is the primary language you will work in.
• Strong communication and writing skills. It's essential that we distill our technical findings into easily consumable reports for non-technical audiences.
• Experience working in teams. This includes working in development sprints, knowledge of best practices in working with Git, and reviewing pull requests. You can provide input to others and equally receive/integrate feedback.
• ~20 hours a week of availability. We expect work to be a bit spiky -- there will be more work to be done when new models come out.

Nice to haves
• Familiarity with LLM methods and developments. Innate interest in the space will make it easier to build a valuable product.
• Experience in ML research setting (or experience with data science). We maintain scientific rigor in our benchmarks to ensure our evaluations are fair and unbiased.

About Us

Measuring model ability is the most challenging part of creating applications that are capable of automating any given part of the economy. There are no good techniques or benchmarks for evaluating LLM performance on business-relevant tasks, so adoption for enterprise production settings has been limited (see Wittgenstein’s ruler).

This problem materializes in each place where LLMs have potential: in understanding whether the AI tool companies are building a product will satisfy a customer demand, determining how feasible models and vendors are for a given enterprise in making purchasing decisions, for researchers who need a north star to which to expand model ability.

Today, answering these questions amounts to hiring a human review team to manually evaluate model outputs. This is prohibitively expensive and slow.

Vals AI is building the enterprise benchmark of LLM and LLM apps on real-world business tasks. In doing so we are creating the infrastructure + certification to automatically audit LLM applications, verifying they are ready for consumption.

See our benchmarks and launch announcement in Bloomberg. We aim to build the barometer for whether AI is useful, and in doing so, accelerate the automation of all knowledge work.

What we are building:

Our core technology enables us to review + automatically audit LLM applications in high value industries (legal, insurance, finance, healthcare). With this and our own data, we maintain a public benchmark of the major LLMs on enterprise tasks. Our success will be based on three components:
• Our evaluation performs at human-level accuracy on the relevant axes for each industry/application.
• Our platform has an intuitive interface that acts as a shared platform between human reviewers and engineers.
• We become the industry-standard benchmark, maintaining a loss-leading effort by publishing free reports and collaborating with credible data partners.

To achieve each of these, we are looking for machine learning engineers (Head of AI, and Members of the Technical Staff) to develop novel evaluation techniques, strong designers and front-end engineers (Founding Product Engineer) to contribute to the platform, and a tenacious operator to write reports and maintain our social media (this role).

What we offer:
• Highly competitive salary. Excellence is well rewarded.
• Optional ability to work in our SF office.
• In office, lunch and dinner provided, free snacks/coffee/drinks.
• Opportunity to grow into a full-time role.

About us:

Founding team: The core methodology behind this platform comes from NLP evaluation research we had done at Stanford. We raised a 5M seed from some of the top institutional and angel investors in the valley. Our team has prior work experience at NVIDIA, Meta, Microsoft, Palantir and HRT. Collectively, we have over 300 citations in our published work.

Tech stack: Our frontend is built in React with TSX. We use Django as our back-end framework. All of the infra is on AWS.

What we’re looking for:
• Intelligence is more important than a good-looking resume. Industry experience and pedigree valuable only insofar as it is a proxy for talent itself.
• Ownership to create products. We don’t have the scale or time to actively “manage” every project or task. Working in a small, talent-dense team, we expect everyone to show initiative to build where it’s needed, not where it’s asked. We strive for autonomy over consensus.
• Intensity. The LLM landscape is constantly changing. Foundation model labs are continuously pushing the frontier, enterprises are seeing massive pressure to adopt technology, startups are hungry to chase the white space. The unicorn companies that will emerge from this technology shift are being built now. Those that win will have an incredibly high speed of execution.
• See solutions not problems. We’re not looking for people that pass hard problems on to others or admit defeat, but instead only see the opportunity to craft solutions at each juncture.

Further Reading:
• Hugging Face blog on evaluation
• Anthropic’s blog on challenges in evaluation
• New York Times article on issues in benchmarking
• Stanford HAI report showing hallucinations in legal tech tools

Referral Bonus

The referral bonus does not apply to this role as it is part-time Apply Job!

Â

Don't Hesitate, Apply!

Don't worry if you don't meet every single requirement. We value a great attitude and a willingness to learn above all. Submit your application today!

Apply To This Job

Similar Jobs

Apply Now: Benefits Programs Specialist - Temporary

Remote, USA Full-time

Apply Now: Benefits Specialist (Work from home) - Flexible hours

Remote, USA Full-time

Apply Now: Benefits Verification Representative - Buffalo Grove

Remote, USA Full-time

Apply Now: Bengali Speaking Social Media Moderator - Part time

Remote, USA Full-time

Apply Now: Best Paying Data Entry Jobs for Teens - CVS Health

Remote, USA Full-time

Apply Now: Bike Delivery Driver

Remote, USA Full-time

Apply Now: Bilingual and Non-Bilingual Customer Service Rep.1

Remote, USA Full-time

Apply Now: Bilingual Call Center Representative Remote

Remote, USA Full-time

Apply Now: Bilingual Community Health Worker - Remote Outreach

Remote, USA Full-time

Apply Now: Bilingual Customer Service Agent - 1 Year Contract

Remote, USA Full-time

Remote Typing Specialist Needed for Part-Time Role | $20/hour – Flexible Hours on PeoplePerHour

Remote, USA Full-time

Apply Now: Material Handler Nights 5:45pm - 6:15am

Remote, USA Full-time

Gaming Moderator

Remote, USA Full-time

BoosterPet, LLC – Reception for Veterinary Clinic – Federal Way, WA

Remote, USA Full-time

Remote Customer Service Representative - Delta Airlines

Remote, USA Full-time

Manufacturing Engineer

Remote, USA Full-time

YouTube Moderator Job (Career At Home)

Remote, USA Full-time

Remote Sales Representative - Entry Level - Part-Time or Full-Time

Remote, USA Full-time

Senior Proposal Writer - Treasury Management

Remote, USA Full-time

Customer Care Representative – Remote

Remote, USA Full-time