DatasetsPorter exports a list of Chunk objects into a Hugging Face Dataset object. This is particularly useful for saving your processed chunks in a standardized format for training models, sharing, or archiving.
Installation
TheDatasetsPorter requires the datasets library. You can install it with:
For general installation instructions, see the Installation
Guide.
Initialization
To get started, simply import and initialize the porter.Parameters
The list of
Chunk objects to be exported.If
True, the dataset will be saved to the location specified in the path
parameter.The local directory path where the dataset should be saved. This is only used
if
save_to_disk is True.Additional keyword arguments to be passed directly to the
datasets.Dataset.save_to_disk method. This allows you to control aspects
like the number of shards or processes.Usage
TheDatasetsPorter can either return a Dataset object directly for in-memory use or save it to disk.
Return a Dataset Object
By default, the porter returns aDataset object without writing any files.
Save a Dataset to Disk
To save the dataset, setsave_to_disk=True and provide a path. The method will still return the Dataset object.
Using as a Callable
The porter can also be used as a callable, which is an alias for theexport method.
Return Type
Theexport method (and the __call__ method) will always return a datasets.Dataset object, regardless of whether it is saved to disk. This allows you to immediately work with the dataset after exporting.