Caching LLM responses¶

This notebook demonstrates how to use Cassandra for a basic prompt/response cache.

Such a cache prevents running an LLM invocation more than once for the very same prompt, thus saving on latency and token usage. The cache retrieval logic is based on an exact match, as will be shown.

In [1]:

                
                    Copied!
                    
from langchain.cache import CassandraCache
from langchain.cache import CassandraCache

In [2]:

                
                    Copied!
                    
from cqlsession import getCQLSession, getCQLKeyspace
astraSession = getCQLSession()
astraKeyspace = getCQLKeyspace()
from cqlsession import getCQLSession, getCQLKeyspace
astraSession = getCQLSession()
astraKeyspace = getCQLKeyspace()

/home/stefano/.virtualenvs/langchain-cassio-3.10/lib/python3.10/site-packages/cassandra/datastax/cloud/__init__.py:173: DeprecationWarning: ssl.PROTOCOL_TLS is deprecated
  ssl_context = SSLContext(PROTOCOL_TLS)
/home/stefano/.virtualenvs/langchain-cassio-3.10/lib/python3.10/site-packages/cassandra/io/asyncorereactor.py:347: DeprecationWarning: ssl.match_hostname() is deprecated
  self._connect_socket()

Create a CassandraCache and configure it globally for LangChain:

In [3]:

                
                    Copied!
                    
import langchain
langchain.llm_cache = CassandraCache(
    session=astraSession,
    keyspace=astraKeyspace,
)
import langchain
langchain.llm_cache = CassandraCache(
    session=astraSession,
    keyspace=astraKeyspace,
)

In [4]:

                
                    Copied!
                    
langchain.llm_cache.clear()
langchain.llm_cache.clear()

In [5]:

                
                    Copied!
                    
from langchain.llms import OpenAI
llm = OpenAI()
from langchain.llms import OpenAI
llm = OpenAI()

In [6]:

                
                    Copied!
                    
%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# The first time, it is not yet in cache, so it should take longer
llm(SPIDER_QUESTION_FORM_1)
%%time
SPIDER_QUESTION_FORM_1 = "How many eyes do spiders have?"
# The first time, it is not yet in cache, so it should take longer
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 33.8 ms, sys: 11.1 ms, total: 45 ms
Wall time: 2.97 s

Out[6]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

In [7]:

                
                    Copied!
                    
%%time
# This time we expect a much shorter answer time
llm(SPIDER_QUESTION_FORM_1)
%%time
# This time we expect a much shorter answer time
llm(SPIDER_QUESTION_FORM_1)

CPU times: user 2.1 ms, sys: 3.24 ms, total: 5.35 ms
Wall time: 119 ms

Out[7]:

'\n\nMost spiders have eight eyes, although some have fewer or more.'

In [8]:

                
                    Copied!
                    
%%time
SPIDER_QUESTION_FORM_2 = "How many eyes do spiders generally have?"
# This will again take 1-2 seconds, being a different string
llm(SPIDER_QUESTION_FORM_2)
%%time
SPIDER_QUESTION_FORM_2 = "How many eyes do spiders generally have?"
# This will again take 1-2 seconds, being a different string
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 13.8 ms, sys: 2.07 ms, total: 15.9 ms
Wall time: 1.93 s

Out[8]:

'\n\nSpiders typically have eight eyes, although some species may have fewer or more.'

Stale entry control¶

Time-To-Live (TTL)¶

You can configure a time-to-live property of the cache, with the effect of automatic eviction of cached entries after a certain time.

Setting langchain.llm_cache to the following will have the effect that entries vanish in an hour:

In [9]:

                
                    Copied!
                    
cacheWithTTL = CassandraCache(
    session=astraSession,
    keyspace=astraKeyspace,
    ttl_seconds=3600,
)
cacheWithTTL = CassandraCache(
    session=astraSession,
    keyspace=astraKeyspace,
    ttl_seconds=3600,
)

Manual cache eviction¶

Alternatively, you can invalidate cached entries one at a time - for that, you'll need to provide the very LLM this entry is associated to:

In [10]:

                
                    Copied!
                    
%%time
llm(SPIDER_QUESTION_FORM_2)
%%time
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 4.27 ms, sys: 1.65 ms, total: 5.92 ms
Wall time: 119 ms

Out[10]:

'\n\nSpiders typically have eight eyes, although some species may have fewer or more.'

In [11]:

                
                    Copied!
                    
langchain.llm_cache.delete_through_llm(SPIDER_QUESTION_FORM_2, llm)
langchain.llm_cache.delete_through_llm(SPIDER_QUESTION_FORM_2, llm)

In [12]:

                
                    Copied!
                    
%%time
llm(SPIDER_QUESTION_FORM_2)
%%time
llm(SPIDER_QUESTION_FORM_2)

CPU times: user 11.6 ms, sys: 7.81 ms, total: 19.4 ms
Wall time: 3.04 s

Out[12]:

'\n\nMost spiders have eight eyes, although there are some species that have fewer or more.'

Whole-cache deletion¶

As you might have seen at the beginning of this notebook, you can also clear the cache entirely: all stored entries, for all models, will be evicted at once:

In [13]:

                
                    Copied!
                    
langchain.llm_cache.clear()
langchain.llm_cache.clear()