Easy Dataset
Github
English
English
  • Product Introduction
  • Installation and Use
  • Basic
    • Projects
      • Task Configuration
      • Model Configuration
      • Model Testing
    • Documents
      • Document Processing
      • Domain Tags
  • Questions
    • Question Generation
    • Question Management
  • Datasets
    • Dataset Generation
    • Dataset Management
    • Dataset Export
    • Dataset Marketplace
  • Evaluations
    • Fine-tuning Evaluation
  • Advanced
    • Text Spliting
    • Custom Prompts
    • Distilled Datasets
  • BLOG
    • Release Notes
    • Community Tutorials
  • Dataset Knowledge
  • MORE
    • FAQ
    • Privacy Policy
    • Contact Us
Powered by GitBook

Copyright © 2025 Easy Dataset

On this page
  • File Types
  • PDF Processing
  • Text Segmentation
  • Literature Management

Was this helpful?

Edit on GitHub
Export as PDF
  1. Basic
  2. Documents

Document Processing

PreviousDocumentsNextDomain Tags

Last updated 19 days ago

Was this helpful?

This module is used to process domain literature in various formats into data structures that can be understood by models.

File Types

Currently, the platform supports processing literature in four formats: Markdown, PDF, DOCX, and TXT:

Models understand Markdown literature with good structural organization best. It is recommended to prioritize uploading Markdown files.

PDF Processing

Due to the special nature of PDF format, the platform supports four different PDF processing methods for different scenarios. When literature containing PDF format is uploaded, a dialog box will appear:

Basic Parsing

Focuses on quickly identifying key outlines of simple PDF files. It is efficient for processing well-structured plain text reports and simple documentation, but cannot accurately parse files containing complex content such as large numbers of formulas and charts.

MinerU API Parsing

MinerU Online Platform Parsing

Custom Vision Model Parsing

Can recognize complex PDF files, including formulas and charts. This method requires adding vision model configuration in the model configuration to parse PDF files through a custom vision model. Parsing rules and model parameters can be customized according to specific needs to adapt to different types of complex PDF files.

When choosing MinerU API parsing or custom vision model parsing, the PDF processing time may be longer, please wait patiently:

You can configure the maximum number of concurrent custom vision models and the maximum number of pages to process simultaneously through "Settings - Task Settings". The more concurrent models, the faster the processing speed, but please consider the concurrency limit of the model provider.

Text Segmentation

Before uploading, please select the model in the top right corner, otherwise, the processing will fail:

Note that there is no need to select a reasoning model (such as DeepSeek-R1) in this step. Selecting a normal question-answering model, such as Doupai or Qianwen, is sufficient. Reasoning models will not provide any advantages in this step and will slow down the processing speed.

After uploading, the platform will intelligently segment the text into blocks, and we can see the segmented text blocks and the number of characters in each block:

We can view the details of each text block:

We can edit each text block:

Literature Management

We can filter the text blocks generated for a specific literature:

We can preview the literature details (converted to Markdown), download the literature (Markdown), and delete the literature:

Preview the literature:

You can configure the MinerU API Key through "Settings - Task Settings" to call the MinerU API for parsing. It can deeply parse complex PDF files containing formulas and charts, suitable for academic papers, technical reports, and other scenarios. The more complex the file, the slower the processing speed. You can apply for a MinerU API Key through (note that the validity period is 14 days, after which you need to reconfigure).

Redirects to the MinerU platform: , where users can parse PDFs and download Markdown files, then return to the platform to re-upload them.

For more information on the principles of text segmentation and how to customize segmentation rules to adapt to different literature structures, please refer to the "" chapter.

https://mineru.net/apiManage/token
https://mineru.net/OpenSourceTools/Extractor
Custom Segmentation