How to Implement a Custom File Parser

Let’s see how to implement a custom file parser to extend the supported files that can be uploaded in the Cat’s memory

By default, the RabbitHole – the component that manages the file uploading in the Cat’s memory – only supports .txt, .md and .pdf file formats. However, we may be interested in uploading other files. The Cat parses files based on their MIME type. This means that when we upload a new file in the memory, the RabbitHole checks the file extension and select the proper parser to process it. Hence, let’s see how to implement a custom parser and how to register it with the proper hook.

The Custom Parser

A parser is an abstraction that allows converting raw data into pieces of text and metadata. Let’s say we want to parse JSON files, here is how a basic parser would look like:

import json
from langchain.document_loaders.base import BaseBlobParser

class JSONParser(BaseBlobParser):

    def lazy_parse(self, blob: Blob) -> Iterator[Document]:
        # the Blob can be treated as a bytes stream
        with blob.as_bytes_io() as file:
            json_to_dict = json.load(file)
        
        dict_to_text = json.dumps(json_to_dict)
        yield Document(page_content=dict_to_text, metadata={})Code language: Python (python)

What have we done? We defined a Python class that inherits its methods from the LangChain abstract class BaseBlobParser. More in detail, we have overridden the lazy_parse method that is called when parsing a file. This method always receives a Blob datatype, you can find more information about blobs here. In this case, we are treating the blob as if we were opening a JSON and loading it as a Python dictionary. Finally, we have yielded an iterator of LangChain Document (one only). The latter is a datatype that stores the text in the page_content attribute and a dictionary of metadata in the metadata attribute.

The Hook

At this point, we only miss to register the new parser in the RabbitHole. For the purpose, we need the rabbithole_instantiates_parsers hook (more info in this table). The hook takes a dictionary as input with key-value pairs where the keys are the MIME types and the values the associated parsers. Here is the code:

from cat.mad_hatter.decorators import hook

@hook
def rabbithole_instantiates_parsers(file_handlers: dict, cat) -> dict:
     
    file_handlers["application/json"] = JSONParser()
    return file_handlersCode language: CSS (css)

Conclusions

We did it, as simple as that. In summary, we have implemented a custom parser that read a stream of bytes from a blob (like we would do from a file) and loaded the JSON content in a Python dictionary. Thus, we serialized the dictionary and stored it in the page_content attribute of the Document datatype. Finally, with the proper hook, we added the custom parser under the "application/json" key, which is the MIME type for JSON files.

If you are interested in tutorial that are specific for other hooks, take a look at how to change the Cat’s prompt.


Posted

in