Let’s see how to implement a custom file parser to extend the supported files that can be uploaded in the Cat’s memory
By default, the RabbitHole – the component that manages the file uploading in the Cat’s memory – only supports .txt
, .md
and .pdf
file formats. However, we may be interested in uploading other files. The Cat parses files based on their MIME type. This means that when we upload a new file in the memory, the RabbitHole checks the file extension and select the proper parser to process it. Hence, let’s see how to implement a custom parser and how to register it with the proper hook.
The Custom Parser
A parser is an abstraction that allows converting raw data into pieces of text and metadata. Let’s say we want to parse JSON files, here is how a basic parser would look like:
import json
from langchain.document_loaders.base import BaseBlobParser
class JSONParser(BaseBlobParser):
def lazy_parse(self, blob: Blob) -> Iterator[Document]:
# the Blob can be treated as a bytes stream
with blob.as_bytes_io() as file:
json_to_dict = json.load(file)
dict_to_text = json.dumps(json_to_dict)
yield Document(page_content=dict_to_text, metadata={})
Code language: Python (python)
What have we done? We defined a Python class that inherits its methods from the LangChain abstract class BaseBlobParser
. More in detail, we have overridden the lazy_parse
method that is called when parsing a file. This method always receives a Blob
datatype, you can find more information about blobs here. In this case, we are treating the blob as if we were opening a JSON and loading it as a Python dictionary. Finally, we have yielded an iterator of LangChain Document
(one only). The latter is a datatype that stores the text in the page_content
attribute and a dictionary of metadata in the metadata
attribute.
The Hook
At this point, we only miss to register the new parser in the RabbitHole. For the purpose, we need the rabbithole_instantiates_parsers
hook (more info in this table). The hook takes a dictionary as input with key-value pairs where the keys are the MIME types and the values the associated parsers. Here is the code:
from cat.mad_hatter.decorators import hook
@hook
def rabbithole_instantiates_parsers(file_handlers: dict, cat) -> dict:
file_handlers["application/json"] = JSONParser()
return file_handlers
Code language: CSS (css)
Conclusions
We did it, as simple as that. In summary, we have implemented a custom parser that read a stream of bytes from a blob (like we would do from a file) and loaded the JSON content in a Python dictionary. Thus, we serialized the dictionary and stored it in the page_content
attribute of the Document
datatype. Finally, with the proper hook, we added the custom parser under the "application/json"
key, which is the MIME type for JSON files.
If you are interested in tutorial that are specific for other hooks, take a look at how to change the Cat’s prompt.
Nicola Corbellini is a PostDoc Researcher at the InfoMus Lab, Casa Paganini, within the DIBRIS department of the University of Genoa. His research focuses on Social Signal Processing, with interests in Hybrid Intelligence and Multimodal Human-Computer Interaction.
Beyond academia, Nicola works as a software developer and has a keen interest in interactive generative visual and new media art.