DatasetsPorter
exports a list of Chunk
objects into a Hugging Face Dataset
object. This is particularly useful for saving your processed chunks in a standardized format for training models, sharing, or archiving.
Installation
TheDatasetsPorter
requires the datasets
library. You can install it with:
For general installation instructions, see the Installation
Guide.
Initialization
To get started, simply import and initialize the porter.Parameters
The list of
Chunk
objects to be exported.If
True
, the dataset will be saved to the location specified in the path
parameter.The local directory path where the dataset should be saved. This is only used
if
save_to_disk
is True
.Additional keyword arguments to be passed directly to the
datasets.Dataset.save_to_disk
method. This allows you to control aspects
like the number of shards or processes.Usage
TheDatasetsPorter
can either return a Dataset
object directly for in-memory use or save it to disk.
Return a Dataset Object
By default, the porter returns aDataset
object without writing any files.
Save a Dataset to Disk
To save the dataset, setsave_to_disk=True
and provide a path
. The method will still return the Dataset
object.
Using as a Callable
The porter can also be used as a callable, which is an alias for theexport
method.
Return Type
Theexport
method (and the __call__
method) will always return a datasets.Dataset
object, regardless of whether it is saved to disk. This allows you to immediately work with the dataset after exporting.