Document Processing

This module is used to process domain literature in various formats into data structures that can be understood by models.

File Types

Currently, the platform supports processing literature in four formats: Markdown, PDF, DOCX, and TXT:

PDF Processing

Due to the special nature of PDF format, the platform supports four different PDF processing methods for different scenarios. When literature containing PDF format is uploaded, a dialog box will appear:

Basic Parsing

Focuses on quickly identifying key outlines of simple PDF files. It is efficient for processing well-structured plain text reports and simple documentation, but cannot accurately parse files containing complex content such as large numbers of formulas and charts.

MinerU API Parsing

You can configure the MinerU API Key through "Settings - Task Settings" to call the MinerU API for parsing. It can deeply parse complex PDF files containing formulas and charts, suitable for academic papers, technical reports, and other scenarios. The more complex the file, the slower the processing speed. You can apply for a MinerU API Key through https://mineru.net/apiManage/token (note that the validity period is 14 days, after which you need to reconfigure).

MinerU Online Platform Parsing

Redirects to the MinerU platform: https://mineru.net/OpenSourceTools/Extractor, where users can parse PDFs and download Markdown files, then return to the platform to re-upload them.

Custom Vision Model Parsing

Can recognize complex PDF files, including formulas and charts. This method requires adding vision model configuration in the model configuration to parse PDF files through a custom vision model. Parsing rules and model parameters can be customized according to specific needs to adapt to different types of complex PDF files.

When choosing MinerU API parsing or custom vision model parsing, the PDF processing time may be longer, please wait patiently:

You can configure the maximum number of concurrent custom vision models and the maximum number of pages to process simultaneously through "Settings - Task Settings". The more concurrent models, the faster the processing speed, but please consider the concurrency limit of the model provider.

Text Segmentation

Before uploading, please select the model in the top right corner, otherwise, the processing will fail:

After uploading, the platform will intelligently segment the text into blocks, and we can see the segmented text blocks and the number of characters in each block:

We can view the details of each text block:

We can edit each text block:

For more information on the principles of text segmentation and how to customize segmentation rules to adapt to different literature structures, please refer to the "Custom Segmentation" chapter.

Literature Management

We can filter the text blocks generated for a specific literature:

We can preview the literature details (converted to Markdown), download the literature (Markdown), and delete the literature:

Preview the literature:

Last updated

Was this helpful?