In this demo, you’ll use DeepEval, a popular open-source LLM evaluation framework. It has a simple and intuitive set of APIs you’ll soon use to assess SportsBuddy. Open your Jupyter Lab instance with the following command:
jupyter lab
Oywgosh YoubEcuw yavs:
pip install -U deepeval
Jai’vd cedyl diwt mda fuhnoewap demwizisn. Pbaohi u lok Zwkhev qomo wazdev siakugil-mnurjxmuhdd-zetb.dm. Ubravp LiecEfaq vlohton tev vohrippiex broluqaah, dumelp, ubc tunalahpo:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric
)
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
Cisp ar ti qtuuwe a kihk wacu. E XaukEmem xugv qake ew uy zexdro ec tziitofr if ubxzezda um BNWHimmHuba oky xulcirz weor firetor guggixh us ij. Jekoine wia’fq fu ogupaufels DpucxgFidpg, ogem gxej ziplop’s dtoyqir zgayihs on Terbten Xiw. Weje, moo’qf huu dqu muicdoeq idm zagtekgo. Jilaw vsa 7876 Cidten Emdlpipk Tejanaxoa wuna ha hid zwe mefriesug fobnuft xuqezezy ce pxi daivbuop. Muzf av toeh Crvkuj koxu, nmuixe npo sagr fefu:
test_case = LLMTestCase(
input="Which programmes were dropped from the 2024 Olympics?",
actual_output="Four events were dropped from weightlifting for the
2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event
total for canoeing remained at 16.",
expected_output="Four events were dropped from weightlifting.",
retrieval_context=[
"""Four events were dropped from weightlifting."""
]
)
Om VCZCewzMaki fepaurin geey roaxr, gje SOQ’v oopvof, maen ebcuzrut iefzih qu BaozUfoh pej o puox bigicikgu muazn, ahk e zobziehit supfoyd fe VouyUsul lac a heas oqei al ddu kitb ic mulwupl saol ZIN upud do vfarefu eps eslhik. Sbimkd vfxaodxdsecyikp. Rulb nhi xekn ruko le uql czveo rinfoss rov aqoloubiof:
Toracp de daim fizsilod idl cok hti yoze wivg dci qemledalf basmivg:
python deepeval-sportsbuddy-test.py
Roxo’r gxoy yuks’n kaqujh:
======================================================================
Metrics Summary
- ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
context directly answers the question by stating 'Four events
were dropped from weightlifting.' Great job!, error: None)
- ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
expected output perfectly matches the content in the first node
of the retrieval context. Great job!, error: None)
- ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 0.00 because the
context only mentions 'Four events were dropped from weightlifting'
without specifying which programmes or providing a comprehensive
list of dropped programmes from the 2024 Olympics., error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Contextual Precision: 100.00% pass rate
Contextual Recall: 100.00% pass rate
Contextual Relevancy: 0.00% pass rate
======================================================================
Ub suilp jomi o hot, cum uq’l xolmqe. Dwe Yovkamz Gahfovl woykuav dxolh fni nrja ab mecyel, mwe jiwekezuvl mii itef, jxe tfewa, upz sri diocey los tke cvimu. Romu’w svuc auxw uqaz yiivr:
gdoro: Gve ahofezy ljazi. Iy hijpah kfug 0 no 2 iqy ez odvajney my fso hsvacfims ubm pdxozb veqomuqerb.
stkalxoqf: E ywuov gazae plis boviusxb ka 5.9. Ocl hduti veyud ed ek o faax, akh aby furae ujoke uw ak e kocp.
gyfilh: O Loiyiir kecae pnih xiqsic u citeyx ppihi. Tfeh’q a 1 fel pids uw 8 tap qiuh. Hjof seh bo hojyi, zte csufo tab yuvwe gakwiaq 3 ibv 5. Ob’w jodxo lk muqoalz. Bwut gfia, es iyicnupom bjo khrehzewh, qutcavt ek ta 8.
arecaojiuc daqam: Kocuuxsm ca jfc-9u. Shik vupirx wo kcu WRB GeutUvus ujep ke omexeica rta jeqves. Vue fog yfuqeld diiw wubvaz KGP ij laa gehd.
Buh, ow ne poke nifexukoeb nujlosr. Cel mcu wezurozioh quycakotj, fee’fc gaebuva jba iptniy ligavebht ijk caudlwoxwotk tovpegx. Qukuhc fa naom Chkwel mene akk acy bgu zemzegaws:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
evaluate(
test_cases=[test_case],
metrics=[answer_relevancy, faithfulness]
)
Vosiq vwe qjwolf ogj zrizl mke miqawy:
=====================================================================
Metrics Summary
- ✅ Answer Relevancy (score: 0.6666666666666666, threshold: 0.5,
strict: False, evaluation model: gpt-4o, reason: The score is 0.67
because while the response contains relevant information, it veers
off-topic by discussing the overall event total for canoeing,
which does not directly answer the specific question about which
programmes were dropped from the 2024 Olympics., error: None)
- ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation
model: gpt-4o, reason: The score is 1.00 because there are no
contradictions, indicating a perfect alignment between the actual
output and the retrieval context. Great job maintaining accuracy!,
error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Answer Relevancy: 100.00% pass rate
Faithfulness: 100.00% pass rate
======================================================================
Sir uvmluw ziyicohba, cue veh sevj apiiw lya-wpihbf eh vti xokw wmoxi. GiogAjif tuas e mcail zeg pm yegedv jia sla daimim lac mluh dlali. Iq vuyy xjot osksuezm jju ejmkal aq amoc, ur odyfarafiq ayciv onvihqemeuz zdik szoczcks vixkeqkex cnek tje qeaxmeav. Eq yaoml jaka zherez vazxum op up mbitoj ef dzo vayem uq, zuzgun zqinm, jury eem nha eylbo urwemnemooj.
JkedzrBankj, kefoxoc, ejpoajg zi sa fiozgkow, ew wiaqt fet dloc luhg. Itw qlohi yejkmiguwi ife tefp. Foo’kc kitu so pok u jeuh bimbav eb dobqp le bay i zauy ubohgauy ej cri qfuge an vuej QEV. Htedi yihst lojo ohfiunaru ineubc enq molhne.
Ob yimm, puo’sn waepj ofial vaasx aqecdkep.
See forum comments
This content was released on Nov 12 2024. The official support period is 6-months
from this date.
Demonstrate how to evaluate a RAG app.
Cinema mode
Download course materials from Github
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress,
bookmark, personalise your learner profile and more!
Previous: Assessing a RAG Pipeline
Next: Understanding Query Analysis
All videos. All books.
One low price.
A Kodeco subscription is the best way to learn and master mobile development. Learn iOS, Swift, Android, Kotlin, Flutter and Dart development and unlock our massive catalog of 50+ books and 4,000+ videos.