版本
v2.0
接入技术服务

智能文档抽取

功能描述

general information extration

智能文档识别(抽取)-API文档

请求URL

https://api.textin.com/ai/service/v2/entity_extraction

HTTP请求方法(Method)

HTTP POST

请求头说明(Request Headers)

请在HTTP请求中添加以下自定义标头(Header)。

header 名
x-ti-app-id 请登录后前往 “工作台-账号设置-开发者信息” 查看 x-ti-app-id
x-ti-secret-code 请登录后前往 “工作台-账号设置-开发者信息” 查看 x-ti-secret-code

URL参数(Parameters)

URL参数指以 {参数名}={参数值} 形式拼接到 URL 上的键值对。它以 ? 开头,不同参数之间使用 & 连接。形如 ?p1=v1&p2=v2
参数名 数据类型 是否必填 允许的值 描述
page_start integer 见描述

当上传的是pdf时,page_start 表示从第几页开始抽取

page_count integer 见描述

当上传的是pdf时,page_count 表示要进行抽取的pdf页数。

  • Prompt模式总页数不得超过20页,默认为20页
  • 自定义key模式总页数不得超过100页,默认为100页
parse_mode string 见描述

PDF解析模式,默认为scan模式,仅按文字识别方式处理。图片不用设置,均按文字识别方式处理。

  • auto 综合文字识别和解析模式
  • scan 仅按文字识别模式
get_image string none, page, objects, both

仅Prompt模式生效,获取图片,默认为objects,返回整页图像和图像对象

  • none 不返回任何图像
  • page 返回每一页的整页图像
  • objects 返回页面内的图像对象
  • both 返回整页图像和图像对象
crop_image integer 见描述

是否进行切边矫正处理,默认为0,不进行切边矫正

  • 0 不进行切边矫正
  • 1 进行切边矫正
remove_watermark integer 见描述

是否进行去水印处理,默认为0,不去水印

  • 0 不去水印
  • 1 去水印
formula_level integer 见描述

公式识别等级,默认为0,全识别。

  • 0 全识别
  • 1 仅识别行间公式,行内公式不识别
  • 2 不识别
file_name string 见描述

待抽取样本的文件名(含后缀名)

请求体说明(Request Body)

Content-Type: application/json

支持的文件格式:png, jpg, jpeg, pdf, bmp, tiff, webp, doc, docx, html, mhtml, xls, xlsx, csv, ppt, pptx, txt, ofd;

  • 支持两种模式:
    • prompt模式:
      • 支持的最大文档处理页数为20页,超出部分的文档信息将被忽略。
      • 提供一个prompt,系统将根据该prompt进行抽取;
      • 同时有prompt输入和key输入时,按prompt模式调用。
    • 自定义key模式:
      • 支持的最大文档处理页数为100页,超出部分的文档信息将被忽略。
      • 提供一个fields与table_fields列表,系统将根据该列表进行抽取。
      • API支持的最大抽取字段数量为fields数组中的元素数量与table_fields数组中每个对象的fields子数组的元素数量之和,总计不得超过100个字段。
      • 如果提供的字段总数超出限制,系统将优先抽取fields数组中的字段元素,超出部分的字段将被忽略。

入参JSON结构说明如下:

字段名 类型 描述
file string

待处理的文档base64字符串

例如:/9j/4AAQSk...

prompt string

抽取的prompt,传入此字段时以下字段将会被忽略:

  • fields
  • table_fields
fields array

待抽取的文本字段

+ name string

字段名

+ description string

抽取时的prompt,非必填

table_fields array

表格抽取时要抽取的表格信息

+ title string

表格标题

例如:学生成绩表

+ description string

表格标题的prompt

+ fields array

该表的表头字段信息,非必填

   ++ name string

字段名

   ++ description string

抽取时的prompt,非必填

响应体说明(Response)

Content-Type: application/json

JSON结构说明如下:

说明:所有接口响应中均包含字段 x_request_id(string类型),作为请求的唯一标识。

字段名 类型 描述
version string

版本号

code integer 错误码,详见“错误码说明”
message string

错误信息

duration integer

推理时间(ms)

result object
+ llm_json

大模型抽取处理后的原始抽取结果,仅当传入prompt参数时返回,返回简化的键值对结构,方便直接使用。

由于使用大模型对用户输入的信息进行抽取,具体的字段名称和数据类型由用户的prompt决定,无法预先确定

+ raw_json object

大模型抽取带坐标信息的抽取结果,仅当传入prompt参数时返回,包含详细的位置信息和边界框数据,用于高级处理场景。

由于使用大模型对用户输入的信息进行抽取,具体的字段名称无法预先确定,但每个字段值都遵循统一的结构格式,包含抽取值、页码信息和详细的坐标数据

  • 注意这里写的object类型不是表示此字段是object,而是表示将llm_json中的字段值从string变成了以下object结构
  • 详情可参考下面json示例中的【示例1: prompt模式-返回对象】与【示例2: prompt模式-返回数组】部分
   ++ value string

字段的抽取值,即为llm_json中对应的字段值

示例:"011892"

   ++ pages array

字段所在的页码列表

示例:[1]

   ++ bounding_regions array

字段的边界框信息,包含详细的位置和字符坐标

    +++ position array

字段在文档中的坐标位置

示例:[201,199,308,199,308,230,201,230]

    +++ char_pos array

每个字符的详细坐标信息

示例:[[202,202,218,201,218,230,201,229],[220,202,235,202,236,228,220,229]]

    +++ page_id integer

所在页码ID

示例:1

    +++ value string

该边界框内的文本内容

示例:"011892"

+ pages array

大模型抽取带坐标信息的抽取结果,仅当传入prompt参数时返回,当文档为多页结构时,返回每一页的详细信息,用于坐标回显

   ++ status string

表示当前页的引擎输出状态,或者error_message

   ++ page_id number

当前页码 (若为流式文件, 页码置为0)

   ++ durations number

当前页总耗时

   ++ image_id string

当前页图片id (下载方式:https://api.textin.com/ocr_image/download?image_id=xxx ,需要在headers里添加appid和key)。当输入参数image_output_type=default且get_image=page/both时返回。 例如使用curl下载
curl 'https://api.textin.com/ocr_image/download?image_id=xxx'
--header 'x-ti-app-id: c81f*************************e9ff'
--header 'x-ti-secret-code: 5508********************1c17'

   ++ origin_image_id string

切边或去水印前的原始页图片,仅当开启切边或去水印,image_output_type=default且get_image=page/both时返回。下载方式同image_id

   ++ width integer

文档页宽度

   ++ height integer

文档页高度

   ++ angle integer

图像(中文字)的角度(当输入为图像时,默认为0, 可选值0, 90, 180, 270)

+ usage object

大模型抽取的token消耗情况,仅当传入prompt参数时返回

   ++ prompt_tokens integer

大模型抽取消耗输入token数量

   ++ completion_tokens integer

大模型抽取消耗输出token数量

   ++ total_tokens integer

大模型抽取消耗token总量

+ category object

details字段里的数据类型

  • one_to_one:表示单值抽取的key
  • item_list:表示表格的抽取
   ++ additionalProp1 string
  • one_to_one
   ++ additionalProp2 string
  • one_to_one
   ++ additionalProp3 string
  • one_to_one
   ++ row string

表格类型

+ rotated_image_width integer

正方向时文档的宽,仅文档为图片时其值有效

+ rotated_image_height integer

正方向时文档高,仅文档为图片时其值有效

+ page_count integer

智能文档抽取处理的文档页数,超过最大页数限制时(100页),返回为最大页数

+ image_angle integer

文档角度,指原文档需要经过逆时针旋转多少度,才能得到正方向的文档,仅文档为图片时其值有效

+ details object

文档抽取结果

   ++ additionalProp1 object
    +++ value string

字段识别结果

    +++ position array

文档被转正后,value在文档中的坐标,是一个长度为8的数组 [0,1,2,3,4,5,6,7]

  • (0, 1) 左上角坐标
  • (2, 3) 右上角坐标
  • (4, 5) 右下角坐标
  • (6, 7) 左下角坐标
    +++ description string

字段中文描述

    +++ lines
   ++ additionalProp2 object
    +++ value string

字段识别结果

    +++ position array

文档被转正后,value在文档中的坐标,是一个长度为8的数组 [0,1,2,3,4,5,6,7]

  • (0, 1) 左上角坐标
  • (2, 3) 右上角坐标
  • (4, 5) 右下角坐标
  • (6, 7) 左下角坐标
    +++ description string

字段中文描述

   ++ additionalProp3 object
    +++ value string

字段识别结果

    +++ position array

文档被转正后,value在文档中的坐标,是一个长度为8的数组 [0,1,2,3,4,5,6,7]

  • (0, 1) 左上角坐标
  • (2, 3) 右上角坐标
  • (4, 5) 右下角坐标
  • (6, 7) 左下角坐标
    +++ description string

字段中文描述

   ++ row array

table_header的抽取结果

+ detail_structure array

字段的识别信息

   ++ doc_type string

文档的类型

   ++ page_range array

抽取的信息所在页范围

   ++ tables array

表格信息

    +++ position array

坐标

    +++ page_number number

所在页

    +++ text string

html形式的表格

   ++ tables_relationship array

表格的结构化信息

    +++ row_count number

行数

    +++ column_count number

列数

    +++ cells array

单元格信息

    +++ title string

title

   ++ category array

结构化抽取出来的所有字段

   ++ fields object

提取的字段结构化结果

    +++ additionalProp1 array
     ++++ value string

字段识别结果

     ++++ bounding_regions array

边界框信息

      +++++ page_number integer

所在页码

      +++++ value string

文本内容

      +++++ position array

文本的坐标

      +++++ char_pos array

每个字符的坐标

    +++ additionalProp2 array
     ++++ value string

字段识别结果

     ++++ bounding_regions array

边界框信息

      +++++ page_number integer

所在页码

      +++++ value string

文本内容

      +++++ position array

文本的坐标

      +++++ char_pos array

每个字符的坐标

    +++ additionalProp3 array
     ++++ value string

字段识别结果

     ++++ bounding_regions array

边界框信息

      +++++ page_number integer

所在页码

      +++++ value string

文本内容

      +++++ position array

文本的坐标

      +++++ char_pos array

每个字符的坐标

   ++ stamps array

印章识别结果

    +++ color string

当前印章颜色

  • 红色
  • 蓝色
  • 黑色
  • 其他
    +++ position array

印章的坐标信息

    +++ stamp_shape string

当前印章形状

  • 圆章
  • 椭圆章
  • 方章
  • 三角章
  • 菱形章
  • 其他
    +++ type string

当前印章类型

  • 公章
  • 个人章
  • 专用章
  • 其他
  • 合同专用章
  • 财务专用章
  • 发票专用章
  • 业务专用章
    +++ value string

印章的文本内容

+ finish_reason string

推理结束的原因

  • stop:正常推理结束
  • length:token超出限制而结束
+ documents array

文档每一页的文档元素信息

   ++ page_id integer

所在页

   ++ position array

坐标信息

   ++ paragraph_id integer

当前段落id

   ++ type string

当前元素的类型

  • paragraph:段落类型,包括正文、标题、公式等文字信息
  • image:图片
  • table:表格,当为表格时,text里存储的为表格的html信息
   ++ text string

文字信息

JSON结构示例

示例1: prompt模式-返回对象

{"result":{"pages":[{"angle":0,"durations":770.3035888671875,"height":1024,"image_id":"53a462433a4f77b5.jpg","width":1192,"page_id":1,"status":"Success"}],"success_count":1,"version":"v1.1.3","llm_json":{"基金代码":"011892","持仓金额":"74178.80","确认日期":"2024/4/3"},"raw_json":{"确认日期":{"pages":[1],"value":"2024/4/3","bounding_regions":[{"value":"2024/4/3","position":[854,180,996,180,996,213,854,213],"char_pos":[[854,186,871,186,872,212,854,213],[872,185,890,184,890,213,872,212],[891,184,907,185,907,212,891,213],[908,184,927,183,927,213,908,212],[928,183,942,184,942,212,929,213],[944,182,963,181,963,212,943,211],[964,180,980,180,980,211,964,211],[981,183,996,183,996,209,981,210]],"page_id":1}]},"基金代码":{"pages":[1],"value":"011892","bounding_regions":[{"page_id":1,"value":"011892","position":[201,199,308,199,308,230,201,230],"char_pos":[[202,202,218,201,218,230,201,229],[220,202,235,202,236,228,220,229],[238,201,253,200,252,229,239,228],[255,200,271,201,272,227,254,228],[272,201,289,200,290,228,272,227],[289,199,308,200,306,227,290,226]]}]},"持仓金额":{"value":"74178.80","bounding_regions":[{"value":"74178.80","position":[505,191,645,191,645,223,505,223],"char_pos":[[505,193,522,194,522,222,506,223],[524,194,542,193,542,222,524,221],[540,193,555,193,554,222,542,222],[559,193,573,193,574,220,558,221],[574,193,592,192,592,221,576,220],[596,213,601,214,601,220,595,220],[611,192,627,191,628,220,611,219],[628,191,645,192,645,219,629,219]],"page_id":1}],"pages":[1]}}},"result_count":1,"msg":"success","code":200,"message":"success","x_request_id":"3047304efb0ba055dde4809c8496847c"}

示例2: prompt模式-返回数组

{"x_request_id":"f6cd2d7e8cdd21a717a89b13e74bb6f9","result":{"success_count":1,"version":"v1.1.3","llm_json":[{"基金代码":"011892","持仓金额":"74178.80","确认日期":"2024/4/3"},{"基金代码":"000188","持仓金额":"501034.18","确认日期":"2024/4/3"}],"raw_json":[{"基金代码":{"bounding_regions":[{"position":[201,199,308,199,308,230,201,230],"char_pos":[[202,202,218,201,218,230,201,229],[220,202,235,202,236,228,220,229],[238,201,253,200,252,229,239,228],[255,200,271,201,272,227,254,228],[272,201,289,200,290,228,272,227],[289,199,308,200,306,227,290,226]],"page_id":1,"value":"011892"}],"pages":[1],"value":"011892"},"持仓金额":{"pages":[1],"value":"74178.80","bounding_regions":[{"position":[505,191,645,191,645,223,505,223],"char_pos":[[505,193,522,194,522,222,506,223],[524,194,542,193,542,222,524,221],[540,193,555,193,554,222,542,222],[559,193,573,193,574,220,558,221],[574,193,592,192,592,221,576,220],[596,213,601,214,601,220,595,220],[611,192,627,191,628,220,611,219],[628,191,645,192,645,219,629,219]],"page_id":1,"value":"74178.80"}]},"确认日期":{"pages":[1],"value":"2024/4/3","bounding_regions":[{"page_id":1,"value":"2024/4/3","position":[854,180,996,180,996,213,854,213],"char_pos":[[854,186,871,186,872,212,854,213],[872,185,890,184,890,213,872,212],[891,184,907,185,907,212,891,213],[908,184,927,183,927,213,908,212],[928,183,942,184,942,212,929,213],[944,182,963,181,963,212,943,211],[964,180,980,180,980,211,964,211],[981,183,996,183,996,209,981,210]]}]}},{"持仓金额":{"pages":[1],"value":"501034.18","bounding_regions":[{"page_id":1,"value":"501034.18","position":[498,241,656,241,656,274,498,274],"char_pos":[[498,245,514,245,514,273,498,274],[514,246,532,245,532,273,514,272],[534,245,547,246,547,272,533,273],[550,244,567,243,566,271,551,271],[568,243,584,244,585,272,567,272],[585,244,603,243,603,271,586,272],[606,264,612,264,612,272,605,271],[622,244,636,244,636,270,622,271],[638,242,656,241,656,271,638,270]]}]},"确认日期":{"value":"2024/4/3","bounding_regions":[{"char_pos":[[854,186,871,186,872,212,854,213],[872,185,890,184,890,213,872,212],[891,184,907,185,907,212,891,213],[908,184,927,183,927,213,908,212],[928,183,942,184,942,212,929,213],[944,182,963,181,963,212,943,211],[964,180,980,180,980,211,964,211],[981,183,996,183,996,209,981,210]],"page_id":1,"value":"2024/4/3","position":[854,180,996,180,996,213,854,213]}],"pages":[1]},"基金代码":{"bounding_regions":[{"page_id":1,"value":"000188","position":[202,250,309,250,309,281,202,281],"char_pos":[[202,253,220,253,220,280,203,281],[220,253,237,252,236,281,220,280],[239,251,254,252,255,279,238,280],[257,253,272,252,271,279,258,278],[274,250,290,252,291,279,273,279],[291,251,309,250,309,278,292,279]]}],"pages":[1],"value":"000188"}}],"pages":[{"durations":772.7508544921875,"page_id":1,"status":"Success","width":1192,"image_id":"53a462433a4f77b5.jpg","height":1024,"angle":0}]},"result_count":1,"msg":"success","code":200,"message":"success"}

示例3: 自定义key模式

{"version":"v1.6.5","code":200,"message":"success","duration":2825,"result":{"category":{"row":"item_list","additionalProp1":"one_to_one","additionalProp2":"one_to_one","additionalProp3":"one_to_one"},"rotated_image_width":1000,"rotated_image_height":2000,"page_count":10,"image_angle":90,"details":{"row":[{"additionalProp1":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]},"additionalProp2":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]},"additionalProp3":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]}}],"additionalProp1":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]},"additionalProp2":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]},"additionalProp3":{"value":"字段识别结果","position":[100,200,200,200,300,200,100,300],"description":"字段中文描述","lines":[{"page":0,"text":"example","pos":[100,200,200,200,300,200,100,300],"angle":90,"char_pos":[[100,200,200,200,300,200,100,300]]}]}},"detail_structure":[{"doc_type":"string","page_range":[0],"tables":[{"position":[343,56,459,56,459,90,343,90],"page_number":0,"text":"string"}],"tables_relationship":[{"row_count":2,"column_count":2,"cells":[{"additionalProp1":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}],"additionalProp2":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}],"additionalProp3":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}]}],"title":"row"}],"category":["标题","性别"],"fields":{"additionalProp1":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}],"additionalProp2":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}],"additionalProp3":[{"value":"string","bounding_regions":[{"page_number":0,"value":"string","position":[343,56,459,56,459,90,343,90],"char_pos":[[343,56,459,56,459,90,343,90]]}]}]},"stamps":[{"color":"红色","position":[956,583,1362,590,1355,990,950,983],"stamp_shape":"圆章","type":"公章","value":"string"}]}],"finish_reason":"stop","documents":[[{"page_id":0,"position":[956,583,1362,590,1355,990,950,983],"paragraph_id":0,"type":"paragraph","text":"string"}]]}}

错误码说明

错误码 描述
40101 x-ti-app-id 或 x-ti-secret-code 为空
40102 x-ti-app-id 或 x-ti-secret-code 无效,验证失败
40103 客户端IP不在白名单
40003 余额不足,请充值后再使用
40004 参数错误,请查看技术文档,检查传参
40007 机器人不存在或未发布
40008 机器人未开通,请至市场开通后重试
40301 图片类型不支持
40302 上传文件大小不符,文件大小不超过 50M
40303 文件类型不支持,接口会返回实际检测到的文件类型,如“当前文件类型为.gif”
40304 图片尺寸不符,图像宽高须介于 20 和 10000(像素)之间
40305 识别文件未上传
40306 qps超过限制
40400 无效的请求链接,请检查链接是否正确
30203 基础服务故障,请稍后重试
500 服务器内部错误
技术交流群
技术交流群

联系我们