From a310d237b4158c213023243118d7ba12d2a549d2 Mon Sep 17 00:00:00 2001 From: jay Date: Wed, 11 Jun 2025 13:07:49 +0000 Subject: [PATCH] update README files --- README.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ README_zh.md | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 182 insertions(+) create mode 100644 README.md create mode 100644 README_zh.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..f484e0b --- /dev/null +++ b/README.md @@ -0,0 +1,92 @@ +# Baidu Wenku Downloader + +A self-developed tool for downloading documents from Baidu Wenku (百度文库), based on reverse engineering Baidu Wenku's Canvas rendering mechanism. + +[中文文档](README_zh.md) + +## Official Website + +Official website: https://lostjay.xyz/gitea/github/baiduwenkudownloader + +For source code access, please contact the author at: lostjaychi@gmail.com + +## Features + +- Download PDF documents from Baidu Wenku +- Support for various document formats +- Asynchronous processing for better performance + +## How to Find Document ID + +The document ID is a unique identifier in the Baidu Wenku URL. For example, in the URL: +``` +https://wenku.baidu.com/view/1898f455874769eae009581b6bd97f192279bff4.html +``` +The document ID is: `1898f455874769eae009581b6bd97f192279bff4` + +You can find this ID in the URL of any Baidu Wenku document page. It's the string of characters between `/view/` and `.html` in the URL. + +## Prerequisites + +- Python 3.11 or higher +- Node.js and npm + +## Installation + +1. Install Python dependencies: +```bash +pip install . +``` + +2. Install Node.js dependencies: +```bash +npm install +``` + +3. Start the Node.js service: +```bash +npm start +``` + +## Project Structure + +- `baiduwenkudownloader/`: Main Python package +- `CrawlerUtils/`: Utility functions for web crawling +- `test/`: Test files + - `test_downloader.py`: Contains main test functions + +## Testing + +The main test function is located in `test/test_downloader.py`. To run the PDF download test: + +```bash +cd test +python -m unittest test_downloader -k test_get_pdf +``` + +## Dependencies + +### Python Dependencies +- bs4 +- lxml +- curl-cffi +- tenacity + +### Node.js Dependencies +- canvas +- express +- jspdf + +## Disclaimer + +This project is for educational and research purposes only. Users are responsible for complying with all applicable laws and regulations. The authors do not endorse or encourage any unauthorized use of this software. Please respect intellectual property rights and use this tool responsibly. + +## License + +ISC License + +## Buy Me a Milk Tea + +If you find this project helpful, feel free to buy the author a milk tea via WeChat Pay! + +![WeChat Pay QR Code](https://lostjay.xyz/wechatpay) diff --git a/README_zh.md b/README_zh.md new file mode 100644 index 0000000..cfe49f8 --- /dev/null +++ b/README_zh.md @@ -0,0 +1,90 @@ +# 百度文库下载器 + +一个基于百度文库 Canvas 渲染机制逆向工程的自研文档下载工具。 + +## 官方网站 + +官方网站:https://lostjay.xyz/gitea/github/baiduwenkudownloader + +如需获取源代码,请联系作者:lostjaychi@gmail.com + +## 功能特点 + +- 支持下载百度文库 PDF 文档 +- 支持多种文档格式 +- 异步处理提升性能 + +## 如何查找文档 ID + +文档 ID 是百度文库 URL 中的唯一标识符。例如,在 URL: +``` +https://wenku.baidu.com/view/1898f455874769eae009581b6bd97f192279bff4.html +``` +文档 ID 为:`1898f455874769eae009581b6bd97f192279bff4` + +您可以在任何百度文库文档页面的 URL 中找到此 ID。它是 URL 中 `/view/` 和 `.html` 之间的字符串。 + +## 环境要求 + +- Python 3.11 或更高版本 +- Node.js 和 npm + +## 安装步骤 + +1. 安装 Python 依赖: +```bash +pip install . +``` + +2. 安装 Node.js 依赖: +```bash +npm install +``` + +3. 启动 Node.js 服务: +```bash +npm start +``` + +## 项目结构 + +- `baiduwenkudownloader/`: 主 Python 包 +- `CrawlerUtils/`: 网络爬虫工具函数 +- `test/`: 测试文件 + - `test_downloader.py`: 包含主要测试函数 + +## 测试 + +主要测试函数位于 `test/test_downloader.py`。要运行 PDF 下载测试: + +```bash +cd test +python -m unittest test_downloader -k test_get_pdf +``` + +## 依赖项 + +### Python 依赖 +- bs4 +- lxml +- curl-cffi +- tenacity + +### Node.js 依赖 +- canvas +- express +- jspdf + +## 免责声明 + +本项目仅供教育和研究目的使用。用户需遵守所有适用的法律法规。作者不认可或鼓励任何未经授权使用本软件的行为。请尊重知识产权并负责任地使用本工具。 + +## 许可证 + +ISC License + +## 请作者喝杯奶茶 + +如果您觉得这个项目有帮助,欢迎通过微信支付请作者喝杯奶茶! + +![微信支付二维码](https://lostjay.xyz/wechatpay) \ No newline at end of file