bbox which is the bounding box of the segment and a segment_type which is the type of the segment.
Segment Model
TheSegment model represents individual elements extracted from the document. It includes the following properties:
Bounding box
TheBoundingBox model represents the bounding box of a segment and can be used to for highlights and annotations, as well as to snip images for vision language and embedding models.
Click here to learn more about the bounding box model.
Content
TheContent model represents the content of a segment, we use the following rules to extract the content:
- If
ocrexists then use the text from theocrif the average confidence is greater than the threshold - If the average confidence from
ocris less than the threshold then use the text layer of the document - If there is no
ocrthen use the text layer of the document
OCR
Theocr field is the OCR result of a segment. Click here to learn more about the OCR model.
Image
Theimage is a presigned URL to the image of the segment. They are only generated for segments that are processed by ocr.
To learn more about how to configure the ocr please refer to the ocr section.
HTML
Thehtml field is the html representation of a segment. It contains the content of the segment with the appropriate HTML tags applied for the segment_type.
For tables the html will contain the table in html format and will be different from the raw content field, but will contain the same text.
Markdown
Themarkdown field is the markdown representation of a segment. It contains the content of the segment with the appropriate markdown tags applied for the segment_type.
For tables the markdown will contain the table in markdown format and will be different from the raw content field, but will contain the same text.
OCR
Theocr field is the OCR result of a segment. Click here to learn more about the OCR model.
Page height
Thepage_height is the height of the page in pixels.
Page width
Thepage_width is the width of the page in pixels.
Page number
Thepage_number is the number of the page in the document where the segment is located.
Segment Types
Thesegment_type is classified by the Model used during the segmentation process. To learn more about how to configure the model please refer to the model section.
This is the list of all the segment types in the order of the hierarchy: