POST
/
v1
/
chunk
/
sdpm
curl --request POST \
  --url https://api.chonkie.ai/v1/chunk/sdpm \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "text": "<string>",
  "embedding_model": "minishlab/potion-base-8M",
  "threshold": "<string>",
  "mode": "window",
  "chunk_size": 512,
  "similarity_window": 1,
  "min_sentences": 1,
  "min_characters_per_sentence": 1,
  "threshold_step": 0.01,
  "delim": "<string>",
  "skip_window": 1,
  "return_type": "chunks"
}'
[
  {
    "text": "<string>",
    "start_index": 123,
    "end_index": 123,
    "token_count": 123,
    "sentences": [
      {
        "text": "<string>",
        "start_index": 123,
        "end_index": 123,
        "token_count": 123,
        "embedding": [
          123
        ]
      }
    ]
  }
]

Authorizations

Authorization
string
header
required

Your API Key from the Chonkie Cloud dashboard

Body

application/json

Data to pass to the SDPM Chunker.

text
required

The input text or list of texts to be chunked.

embedding_model
string
default:minishlab/potion-base-8M

Model identifier or embedding model instance

threshold
default:auto

When in the range [0,1], denotes the similarity threshold to consider sentences similar. When in the range (1,100], interprets the given value as a percentile threshold. When set to 'auto', the threshold is automatically calculated.

Allowed value: "auto"
mode
string
default:window

Mode for grouping sentences, either 'cumulative' or 'window'

chunk_size
integer
default:512

Maximum tokens per chunk

similarity_window
integer
default:1

Number of sentences to consider for similarity threshold calculation

min_sentences
integer
default:1

Minimum number of sentences per chunk

min_characters_per_sentence
integer
default:1

Minimum number of characters per sentence

threshold_step
number
default:0.01

Step size for threshold calculation

delim
default:["\n",".","!","?"]

Delimiters to split sentences on

skip_window
integer
default:1

Number of chunks to skip when looking for similarities

return_type
enum<string>
default:chunks

Return type for the chunking process. If 'chunks', returns a list of SemanticChunk objects. If 'texts', returns a list of strings.

Available options:
texts,
chunks

Response

200 - application/json
Successful Response: A list of semantic chunk objects.
text
string

The actual text content of the chunk.

start_index
integer

The starting character index of the chunk within the original input text.

end_index
integer

The ending character index (exclusive) of the chunk within the original input text.

token_count
integer

The number of tokens in this specific chunk, according to the tokenizer used.

sentences
object[]

List of semantic sentences contained within this chunk.

Represents a single sentence within a semantic chunk, potentially including an embedding.