Semantic LLM caching¶
NOTE: this uses Cassandra's experimental "Vector Similarity Search" capability. At the moment, this is obtained by building and running an early alpha from a specific branch of the codebase.
The Cassandra-backed "semantic cache" for prompt responses is imported like this:
from langchain.cache import CassandraSemanticCache
As usual, a database connection is needed to access Cassandra. The following assumes that a vector-search-capable Cassandra cluster is running locally. Adjust as needed.
from cqlsession import getLocalSession, getLocalKeyspace
localSession = getLocalSession()
localKeyspace = getLocalKeyspace()
An embedding function and an LLM are needed:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
myEmbedding = OpenAIEmbeddings()
llm = OpenAI(model_name="text-davinci-002", n=2, best_of=2)
Note: for the time being you have to explicitly turn on this experimental flag on the cassio
side:
import cassio
cassio.globals.enableExperimentalVectorSearch()
Create the cache¶
At this point you can instantiate the semantic cache:
cassSemanticCache = CassandraSemanticCache(
session=localSession,
keyspace=localKeyspace,
embedding=myEmbedding,
)
Make sure the cache starts empty with:
cassSemanticCache.clear_through_llm(llm=llm)
Configure the cache at a LangChain global level:
import langchain
langchain.llm_cache = cassSemanticCache
Use the cache¶
Now try submitting a few prompts to the LLM and pay attention to the response times.
If the LLM is actually run, they should be the order of a few seconds; but in case of a cache hit, it will be way less than a second.
Notice that you get a cache hit even after rephrasing the question.
%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# A new question should take long
llm(SPIDER_QUESTION_FORM_1)
CPU times: user 24.6 ms, sys: 0 ns, total: 24.6 ms Wall time: 1.27 s
'\n\nSpiders have eight eyes.'
%%time
# Second time, very same question, this should be quick
llm(SPIDER_QUESTION_FORM_1)
CPU times: user 15.7 ms, sys: 0 ns, total: 15.7 ms Wall time: 114 ms
'\n\nSpiders have eight eyes.'
%%time
SPIDER_QUESTION_FORM_2 = "How many eyes does a spider generally have?"
# Just a rephrasing: but it's the same question, so ...
llm(SPIDER_QUESTION_FORM_2)
CPU times: user 18.2 ms, sys: 0 ns, total: 18.2 ms Wall time: 571 ms
'\n\nSpiders have eight eyes.'
Time for a really new question:
%%time
LOGIC_QUESTION_FORM_1 = "Is absence of proof the same as proof of absence?"
# A totally new question
llm(LOGIC_QUESTION_FORM_1)
CPU times: user 31.6 ms, sys: 0 ns, total: 31.6 ms Wall time: 1.26 s
'\n\nNo, absence of proof is not the same as proof of absence.'
%%time
SPIDER_QUESTION_FORM_3 = "How many eyes are on the head of a typical spider?"
# Trying to catch the cache off-guard :)
llm(SPIDER_QUESTION_FORM_3)
CPU times: user 30.3 ms, sys: 327 µs, total: 30.7 ms Wall time: 573 ms
'\n\nSpiders have eight eyes.'
%%time
LOGIC_QUESTION_FORM_2 = "Is it true that the absence of a proof equates the proof of an absence?"
# Switching to the other question again
llm(LOGIC_QUESTION_FORM_2)
CPU times: user 20.8 ms, sys: 0 ns, total: 20.8 ms Wall time: 629 ms
'\n\nNo, absence of proof is not the same as proof of absence.'
Additional options¶
When creating the semantic cache, you can specify a few other options such as the metric used to calculate the similarity and the number of entries to retrieve in the ANN step (i.e. those on which the exact requested metric is computed for the final filtering). Here is an example which uses the L2 metric:
anotherCassSemanticCache = CassandraSemanticCache(
session=localSession,
keyspace=localKeyspace,
embedding=myEmbedding,
distance_metric='l2',
score_threshold=0.35,
num_rows_to_fetch=12,
)
This cache builds on the same database table as the previous one, as can be seen e.g. with:
lookup = anotherCassSemanticCache.lookup_with_id_through_llm(
LOGIC_QUESTION_FORM_2,
llm,
)
if lookup:
docId, response = lookup
print(docId)
print(response)
else:
print('No match.')
77add13036bcaa23c74ebf2ab2c56441 [Generation(text='\n\nNo, absence of proof is not the same as proof of absence.', generation_info=None), Generation(text='\n\nNo, absence of proof is not the same as proof of absence.', generation_info=None)]
Stale entry control¶
Time-To-Live (TTL)¶
You can configure a time-to-live property of the cache, with the effect of automatic eviction of cached entries after a certain time.
Setting langchain.llm_cache
to the following will have the effect that entries vanish in an hour:
cacheWithTTL = CassandraSemanticCache(
session=localSession,
keyspace=localKeyspace,
embedding=myEmbedding,
ttl_seconds=15,
)
Manual cache eviction¶
Alternatively, you can invalidate individual entries one at a time, just like you saw for the exact-match CassandraCache
cache.
But this is an index based on sentence similarity, so this time the procedure has two steps: first, a lookup to find the id of the matching document:
lookup = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_1, llm)
if lookup:
docId, response = lookup
print(docId)
else:
print('No match.')
0a1339bc659790da078a4352c05bf422
you can see that querying for another form for the "same" question will result in the same id:
lookup2 = cassSemanticCache.lookup_with_id_through_llm(SPIDER_QUESTION_FORM_2, llm)
if lookup:
docId2, response2 = lookup2
print(docId2)
else:
print('No match.')
0a1339bc659790da078a4352c05bf422
and second, the document id is used in the actual cache eviction (again, you have to additionally provide the LLM):
cassSemanticCache.delete_by_document_id_through_llm(docId, llm)
As a check, try asking that question again:
%%time
# Trying to catch the cache off-guard :)
llm(SPIDER_QUESTION_FORM_1)
CPU times: user 21.8 ms, sys: 630 µs, total: 22.4 ms Wall time: 774 ms
'\n\nSpiders have eight eyes.'
Whole-cache deletion¶
Lastly, as you have seen earlier, you can empty the cache entirely, for a given LLM, with:
cassSemanticCache.clear_through_llm(llm=llm)