Chunkr provides structured extraction capabilities to extract specific data fields from documents according to a defined JSON schema. This allows you to convert unstructured document content into structured data formats.
JSON Schema
When creating a task, you can provide a JSON schema that defines the structure of data you want to extract. Here is an example of how to set up a structured extraction task:
curl -v -X POST "https://legacy-api.chunkr.ai/api/v1/task" \
-H "Authorization: <your_api_key>" \
-F "file=@/Users/pyscripts/input/test.pdf" \
-F "model=Fast" \
-F "target_chunk_length=512" \
-F "ocr_strategy=Auto" \
-F 'json_schema={
"title": "Basket",
"type": "object",
"properties": [
{
"name": "patient cohort information",
"title": "Patient Cohort Information",
"type": "string",
"description": "A summary of patient cohort information",
"default": null
},
{
"name": "implications of data",
"title": "Implications of Data",
"type": "string",
"description": "The implications of the observed data",
"default": null
}
]
};type=application/json'
JSON Schema Structure
The json_schema
defines the structure of the data to be extracted. It consists of a title
, type
, and a list of properties
. Each property represents a specific field to extract from the document.
TypeScript Interface Representation
Below are the TypeScript interfaces that model the JSON schema:
export interface JsonSchema {
title: string;
type: string;
properties: Property[];
}
export interface Property {
name: string;
title?: string;
type: string;
description?: string;
default?: string;
}
Property Fields Explanation
- name: The identifier for the field in the extracted data.
- title: A human-readable title for the field.
- type: The data type of the field (e.g.,
string
, list
).
- description: A description of what the field represents.
- default: The default value for the field if no data is extracted.
Interpreting the Response
Once the task is completed, the response will include the extracted data in a structured format. Here is an example of the output:
{
"configuration": {
"model": "Fast",
"ocr_strategy": "Auto",
"target_chunk_length": 512
},
"created_at": "2023-04-15T10:30:00Z",
"expires_at": "2023-04-22T10:30:00Z",
"file_name": "document.pdf",
"finished_at": "2023-04-15T10:31:30Z",
"input_file_url": "https://storage.chunkr.ai/input/document.pdf",
"message": "Task completed successfully",
"output": {
"chunk_length": 1,
"segments": [
{
"bbox": {},
"content": "",
"html": "",
"image": "",
"markdown": "",
"ocr": [],
"page_height": 842.0,
"page_number": 1,
"page_width": 595.0,
"segment_id": "<segment_id>",
"segment_type": "Text"
}
]
},
"page_count": 5,
"status": "Succeeded",
"task_id": "<task_id>",
"task_url": "https://legacy-api.chunkr.ai/api/v1/task/<task_id>"
},
"extracted_json": {
"title": "Clinical Trial Results",
"schema_type": "object",
"extracted_fields": [
{
"name": "patient cohort information",
"field_type": "string",
"value": "The patient cohort consisted of 30-50 year old males with a history of hypertension and diabetes."
},
{
"name": "implications of data",
"field_type": "string",
"value": "The effect of citrus bergamot on heart rate was observed to be statistically significant in the clinical trial performed."
}
]
}