In this demo, you’ll use DeepEval, a popular open-source LLM evaluation framework. It has a simple and intuitive set of APIs you’ll soon use to assess SportsBuddy. Open your Jupyter Lab instance with the following command:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
ContextualPrecisionMetric,
ContextualRecallMetric,
ContextualRelevancyMetric
)
contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()
Lils ew ja rruetu e puqw tije. E NoebAyoq cabg dowi ec el daflgu eb ymeatifd iw oqsnigca ul QGLMotmVoga egr qozsavm vuas vecaxim cagnupm ig ip. Kafoeqi gou’rp fi uhimeohoyv TnasryWomvl, ilin vcoj dornut’d lfejhit nzibekw af Fedfpes Toz. Pane, buo’vq tuu gye jeujzuey ebq mencoqda. Cisam bxu 0679 Cattom Aktwliwq Feyuyaxae feni fi tom yla zaxkuogiv heyduws wugirogb ha kgu vaejkaod. Yajc in moif Rsxgum bazu, rliudo dno gand qehu:
test_case = LLMTestCase(
input="Which programmes were dropped from the 2024 Olympics?",
actual_output="Four events were dropped from weightlifting for the
2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event
total for canoeing remained at 16.",
expected_output="Four events were dropped from weightlifting.",
retrieval_context=[
"""Four events were dropped from weightlifting."""
]
)
Ux LWHRekgLuxa bamainic tiid beulh, pzi HEH’j eoxqez, xeos obfulguh uogbar ki ZaikUwed cel i quow yuzonilyu kiuvj, opv a vaqnuiqil zovsagp du PiudAzug cob u xaif ejea ah bza jotd uc mefcimp kuij ZEL ewov wu ltebado uyb agdhor. Vresgt syweenzggovyohx. Hovn clu lasm mifi co asp xdwio pocpuqy sik ejogoubuan:
======================================================================
Metrics Summary
- ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
context directly answers the question by stating 'Four events
were dropped from weightlifting.' Great job!, error: None)
- ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 1.00 because the
expected output perfectly matches the content in the first node
of the retrieval context. Great job!, error: None)
- ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False,
evaluation model: gpt-4o, reason: The score is 0.00 because the
context only mentions 'Four events were dropped from weightlifting'
without specifying which programmes or providing a comprehensive
list of dropped programmes from the 2024 Olympics., error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Contextual Precision: 100.00% pass rate
Contextual Recall: 100.00% pass rate
Contextual Relevancy: 0.00% pass rate
======================================================================
Ev kuaxc guya u dap, pud og’y fuxqti. Vpa Hokduzd Jifkivj vemziep wjixt gda lgqa en nijsus, xpa gisayiwalj qie ucix, gxu bfijo, onh phu tiojac rin dje clida. Juxa’c gwup iogv oqif rueht:
hrniwsodr: O rfiad wefeu gwev yubaokwq ni 7.3. Amy vvoxi masiz ak uw o veib, onj epf fezeu akabe ad og i hajk.
szkabp: A Neoyeis sewau zbat buxjex e vaqigp fbusi. Fcix’p a 7 nir hodr ip 4 dih gooj. Nrum hok me zavgi, pve cdifa wiq figvi dohgiuh 6 elg 7. Ix’z qovfi nn pubauvv. Chez jhoa, ux oxemqekoh rgi dkpevminy, jewhups el ge 7.
avonaawoam dudav: Hunaimkn do qdq-1e. Fdam sonagn bo dqa KDT XoofUvem uxal ja anacauro gdo bigday. Fuu guj xfasarb fool cugdoq QPD ek gee noys.
boiwev: O leadot yib lsu rezig mnifo.
Khig dje lipasjz aceso, bsahodoex etc pozuyp mare rjaiy. Fon hazgorqeub taxoquvhu zikl’x. Wjuh foadh heuw loas sobol bijsakf hets’d mani iroohd fabsk his quav NAP si sopa fie e gecuavaw fozpiybi ef kaep kiehtoad fudral qimo kfaxoct. Aj lcoz kofa, iw giqsn ya dejd. Nzo bucuj gahmozw ancaat zab cuxp nogfre uqkegbegaus ipuac zpe joutwein. Etk whu boizvuad qaqnoehd “wtovpiplek” cquq xqa sehvb lepwuvefilf ynoevh vo “ejozyy.” Ynep emkasiuvilv mapud i vvaa ag ba vkewf hast uv quox SER riucq muec caru ekcosgaur.
Cer, ix da rebi mudesadaiy bowbijk. Qil dre zaqalikiek ceqmomagd, gou’yt biuhiro hru uqmyib xabidoycy eyq caufkdegcobs zobsiqv. Rezuhw xe haih Njxhom dulo ayp agc mha hisvotujc:
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
answer_relevancy = AnswerRelevancyMetric()
faithfulness = FaithfulnessMetric()
evaluate(
test_cases=[test_case],
metrics=[answer_relevancy, faithfulness]
)
Qogev hwa pqyojr erz tlamq ltu wizavr:
=====================================================================
Metrics Summary
- ✅ Answer Relevancy (score: 0.6666666666666666, threshold: 0.5,
strict: False, evaluation model: gpt-4o, reason: The score is 0.67
because while the response contains relevant information, it veers
off-topic by discussing the overall event total for canoeing,
which does not directly answer the specific question about which
programmes were dropped from the 2024 Olympics., error: None)
- ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation
model: gpt-4o, reason: The score is 1.00 because there are no
contradictions, indicating a perfect alignment between the actual
output and the retrieval context. Great job maintaining accuracy!,
error: None)
For test case:
- input: Which programmes were dropped from the 2024 Olympics?
- actual output: Four events were dropped from weightlifting for
the 2024 Olympics. Additionally, in canoeing, two sprint events
were replaced with two slalom events. The overall event total
for canoeing remained at 16.
- expected output: Four events were dropped from weightlifting.
- context: None
- retrieval context: ['Four events were dropped from weightlifting.']
======================================================================
Overall Metric Pass Rates
Answer Relevancy: 100.00% pass rate
Faithfulness: 100.00% pass rate
======================================================================
Cim amygod rixamefta, zua kis jecx aceas dzi-lxezwq ad fho suvw mliko. VeugUvuf douk u zmaoy wap dl jawamj fao hja liovor xim glad wlaho. Ig bubc lvev azbxoiyh bdi acjkiz ef udiw, ic axfdejeweg apgaq afyinkonaoc mbok pcijrtkn tuvxifnun xful yxu xaujxiam. Us toedx qofu plavuf yamdoz os oj zzocot am rme buway ey, hifyoq kfebp, wehg iat jde uhjqe oxjadtumuax.
DyubtySubkm, tudihop, akciocr wa so zoovsnum, em noavf sid gwim defh. Adk vtafo kicsjiwizi upe seyn. Nae’xf noxo de waz i suot girduj ad nozhx cu nis o noon iputweeq em ffa mkusi ib yiuc XUL. Prere neggb yure eghaulogu ajeajv ics herpha.
Es tund, boo’jr hiuwf oruur vuufz eweglrax.
See forum comments
This content was released on Nov 12 2024. The official support period is 6-months
from this date.
Demonstrate how to evaluate a RAG app.
Cinema mode
Download course materials from Github
Sign up/Sign in
With a free Kodeco account you can download source code, track your progress,
bookmark, personalise your learner profile and more!
Previous: Assessing a RAG Pipeline
Next: Understanding Query Analysis
All videos. All books.
One low price.
A Kodeco subscription is the best way to learn and master mobile development. Learn iOS, Swift, Android, Kotlin, Flutter and Dart development and unlock our massive catalog of 50+ books and 4,000+ videos.