github/baiduwenkudownloader

Fork 0

Go to file

Jay 626a8feb20 update repository files

2025-07-25 02:37:19 +00:00

test

update repository files

2025-07-07 13:13:37 +00:00

LICENSE

update repository files

2025-06-12 08:33:59 +00:00

package-lock.json

update repository files

2025-06-12 08:33:59 +00:00

package.json

update repository files

2025-06-12 08:33:59 +00:00

poetry.lock

update repository files

2025-07-04 09:17:12 +00:00

pyproject.toml

update repository files

2025-07-04 09:17:12 +00:00

README_zh.md

update repository files

2025-07-25 02:37:19 +00:00

README.md

update repository files

2025-07-25 02:37:19 +00:00

README.md

Baidu Wenku Downloader

A self-developed tool for downloading documents from Baidu Wenku (百度文库), based on reverse engineering Baidu Wenku's Canvas rendering mechanism.

中文文档

Official Website

Official website: Gitea Repository

GitHub mirror: GitHub Repository

For complete source code access, please contact the author at: lostjaychi@gmail.com

Features

Download PDF documents from Baidu Wenku
Support for various document formats
Asynchronous processing for better performance

Demo Video

Watch the demo video on Bilibili: Baidu Wenku Downloader Demo

How to Find Document ID

The document ID is a unique identifier in the Baidu Wenku URL. For example, in the URL:

https://wenku.baidu.com/view/1898f455874769eae009581b6bd97f192279bff4.html

The document ID is: 1898f455874769eae009581b6bd97f192279bff4

You can find this ID in the URL of any Baidu Wenku document page. It's the string of characters between /view/ and .html in the URL.

Test API

A test API endpoint is available for testing PDF downloads:

https://lostjay.xyz/api/get_pdf?doc_id=6bb03e2669dc5022aaea00e5

Important Note: Only one PDF can be processed at a time. If the server is busy, you may receive a "Server is busy, please try again later" response.

Prerequisites

Python 3.11 or higher
Node.js and npm

Installation

Install Python dependencies:

pip install .

Install Node.js dependencies:

npm install

Start the Node.js service:

npm start

Project Structure

baiduwenkudownloader/: Main Python package
CrawlerUtils/: Utility functions for web crawling
test/: Test files
- test_downloader.py: Contains main test functions

Testing

The main test function is located in test/test_downloader.py. To run the PDF download test:

cd test
python -m unittest test_downloader -k test_get_pdf

Dependencies

Python Dependencies

bs4
lxml
curl-cffi
tenacity

Node.js Dependencies

canvas
express
jspdf

Disclaimer

This project is for educational and research purposes only. Users are responsible for complying with all applicable laws and regulations. The authors do not endorse or encourage any unauthorized use of this software. Please respect intellectual property rights and use this tool responsibly.

License

ISC License

Description

A self-developed tool for downloading documents from Baidu Wenku (百度文库), based on reverse engineering Baidu Wenku's Canvas rendering mechanism. This project only shows partial codes, for complete source code access, please contact the author at: lostjaychi@gmail.com 一个基于百度文库 Canvas 渲染机制逆向工程的自研文档下载工具。该项目只显示了部分代码, 如需获取完整源代码，请联系作者：lostjaychi@gmail.com

https://lostjay.xyz/gitea/github/baiduwenkudownloader

Readme ISC 100 KiB