tio-boot 整合 Playwright
简介
在使用 Playwright 进行网页数据抓取时,每次启动 Playwright 都会带来较大的性能开销。为了解决这一问题,可以在服务启动时初始化 Playwright,并在服务关闭时正确释放资源,从而显著提升性能。本文将介绍如何使用 TioBoot 整合 Playwright,实现高效的网页数据获取服务,并探讨 BrowserContextPool 的设计理念以及线程安全相关的注意事项。
整合示例
1. 添加依赖
在你的 pom.xml
文件中添加 Playwright 的依赖:
<dependency>
<groupId>com.microsoft.playwright</groupId>
<artifactId>playwright</artifactId>
<version>1.27.0</version> <!-- 请检查最新版本 -->
</dependency>
2. 线程安全的使用
在早期的实现中,为每个请求创建新的 Playwright 实例和浏览器上下文,如下所示:
import com.litongjava.annotation.RequestPath;
import com.litongjava.model.body.RespBodyVo;
import com.microsoft.playwright.Browser;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.BrowserType;
import com.microsoft.playwright.BrowserType.LaunchOptions;
import com.microsoft.playwright.Page;
import com.microsoft.playwright.Playwright;
import lombok.extern.slf4j.Slf4j;
@RequestPath("/playwrite")
@Slf4j
public class PlaywriteTestController {
LaunchOptions launchOptions = new BrowserType.LaunchOptions().setHeadless(false);
public RespBodyVo newContext() {
String link1 = "https://studentservices.stanford.edu/calendar/academic-dates/stanford-academic-calendar-2024-2025";
//显示网页
for (int i = 0; i < 3; i++) {
try (Playwright playwright = Playwright.create()) {
BrowserType chromium = playwright.chromium();
try (Browser broswer = chromium.launch(launchOptions);) {
try (BrowserContext context = broswer.newContext()) {
try (Page page = context.newPage()) {
//显示窗口
page.navigate(link1);
String bodyText = page.innerText("body");
} catch (Exception e) {
log.error("Error getting content from {}: {}", link1, e.getMessage(), e);
}
}
}
}
}
try {
Thread.sleep(200000);
} catch (InterruptedException e) {
e.printStackTrace();
}
return RespBodyVo.ok();
}
}
为什么 BrowserContextPool
要创建对应数量的 Playwright、Browser 和 BrowserContext?
在 BrowserContextPool
的设计中,我们预先创建了固定数量的 Playwright、Browser 和 BrowserContext 实例,并将它们存入各自的线程安全队列中。这么做有以下几个原因:
减少启动开销:Playwright 和浏览器实例的启动过程消耗较大,通过在服务启动时预先创建这些实例,可以避免在每次请求时重新启动,从而大幅提升响应速度。
资源复用:通过池化管理,可以复用已经创建的实例,减少重复创建和销毁对象的资源消耗,提高资源利用率。
应对高并发请求:在高并发场景下,提前创建多个实例可以同时处理多个请求,避免因为等待资源创建而导致的延迟。
线程安全的管理:使用
BlockingQueue
等线程安全数据结构,可以确保在多线程环境下安全地获取和归还资源。
线程安全性分析
需要注意的是,Playwright 中的 BrowserContext
和 Page
对象都是线程不安全的。这意味着:
- 在同一时间内,不能在多个线程中共享同一个
BrowserContext
或Page
实例,否则可能导致数据竞争、状态混乱等问题。 - 每个线程在使用
BrowserContext
或Page
时,应该确保其独占性,或者从资源池中获取专属的实例进行操作。
在我们的实现中,为每次具体网页操作从池中获取一个 BrowserContext
,创建新的 Page
进行页面操作。操作完成后,关闭 Page
并将 BrowserContext
归还池中。这种设计有效地避免了在多个线程之间共享不安全的对象,确保了线程安全。
public static String getHtml(String url) {
BrowserContext context = null;
Page page = null;
String content = "";
try {
// 从池中获取一个上下文,最多等待5秒
context = contextPool.acquire(5, TimeUnit.SECONDS);
if (context == null) {
throw new RuntimeException("无法获取 BrowserContext");
}
page = context.newPage();
page.navigate(url);
content = page.content();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("获取 BrowserContext 被中断", e);
} finally {
if (page != null) {
page.close();
}
// 将上下文归还池中
if (context != null) {
contextPool.release(context);
}
}
return content;
}
BrowserContext 与 Page 的线程安全性
线程不安全:由于
BrowserContext
和Page
不是线程安全的,同一实例不能并发使用。如果在不同线程中共享同一个实例,会产生不可预知的行为。因此,每个线程应当获取自己独立的BrowserContext
和Page
实例。活动状态的 BrowserContext:当通过
Browser.newContext()
方法创建一个新的BrowserContext
时,它会智能地处于“活动状态”,这意味着该上下文已经准备好接受页面创建和导航等操作。每个BrowserContext
通常关联一个或多个Page
对象,用于执行具体的网页操作。Browser 同时只能运行一个 BrowserContext 处于活动状态
通过这种方式,系统在每个请求处理期间只使用自己的上下文和页面,避免了线程安全问题。页面操作完成后,通过关闭 Page
并将 BrowserContext
归还到池中,使得资源得以高效复用。
2. Playwright 实例管理
为了解决性能瓶颈,我们设计了一个 BrowserContextPool,在服务启动时初始化一定数量的 Playwright、Browser 和 BrowserContext 实例,并将它们放入线程安全的池中复用。
package com.litongjava.perplexica.instance;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.TimeUnit;
import com.microsoft.playwright.Browser;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.BrowserType;
import com.microsoft.playwright.BrowserType.LaunchOptions;
import com.microsoft.playwright.Playwright;
public class BrowserContextPool {
private final BlockingQueue<Playwright> playwrightPool;
private final BlockingQueue<Browser> brwoserPool;
private final BlockingQueue<BrowserContext> browserContextPool;
public BrowserContextPool(int poolSize) {
this.playwrightPool = new LinkedBlockingQueue<>(poolSize);
this.brwoserPool = new LinkedBlockingQueue<>(poolSize);
this.browserContextPool = new LinkedBlockingQueue<>(poolSize);
LaunchOptions launchOptions = new BrowserType.LaunchOptions().setHeadless(true);
// 预先创建上下文并放入池中
for (int i = 0; i < poolSize; i++) {
Playwright playwright = Playwright.create();
playwrightPool.offer(playwright);
Browser brwoser = playwright.chromium().launch(launchOptions);
brwoserPool.offer(brwoser);
BrowserContext browserContext = brwoser.newContext();
//browserContext.newPage();
browserContextPool.offer(browserContext);
}
}
/**
* 从池中获取一个 BrowserContext。如果池为空,则等待指定时间后返回null。
*/
public BrowserContext acquire(long timeout, TimeUnit unit) throws InterruptedException {
return browserContextPool.poll(timeout, unit);
}
/**
* 将使用完毕的 BrowserContext 归还到池中
*/
public void release(BrowserContext context) {
if (context != null) {
browserContextPool.offer(context);
}
}
/**
* 释放池中所有的 BrowserContext 资源
*/
public void close() {
for (Playwright context : playwrightPool) {
context.close();
}
playwrightPool.clear();
for (Browser context : brwoserPool) {
context.close();
}
brwoserPool.clear();
for (BrowserContext context : browserContextPool) {
context.close();
}
browserContextPool.clear();
}
}
package com.litongjava.perplexica.instance;
import java.util.concurrent.TimeUnit;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.Page;
public enum PlaywrightBrowser {
INSTANCE;
// 定义池化管理器
public static BrowserContextPool contextPool;
static {
// 初始化上下文池,假设池大小为10,可根据需要调整
contextPool = new BrowserContextPool(3);
}
public static void init() {
}
public static void close() {
// 关闭上下文池中的所有上下文
contextPool.close();
}
public static String getHtml(String url) {
BrowserContext context = null;
Page page = null;
String content = "";
try {
// 从池中获取一个上下文,最多等待5秒
context = contextPool.acquire(5, TimeUnit.SECONDS);
if (context == null) {
throw new RuntimeException("无法获取 BrowserContext");
}
page = context.newPage();
page.navigate(url);
content = page.content();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("获取 BrowserContext 被中断", e);
} finally {
if (page != null) {
page.close();
}
// 将上下文归还池中
if (context != null) {
contextPool.release(context);
}
}
return content;
}
public static String getBodyContent(String url) {
BrowserContext context = null;
Page page = null;
String textContent = "";
try {
context = contextPool.acquire(5, TimeUnit.SECONDS);
if (context == null) {
throw new RuntimeException("无法获取 BrowserContext");
}
page = context.newPage();
page.navigate(url);
textContent = page.innerText("body");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("获取 BrowserContext 被中断", e);
} finally {
if (page != null) {
page.close();
}
if (context != null) {
contextPool.release(context);
}
}
return textContent;
}
public static BrowserContext acquire() {
try {
return contextPool.acquire(60, TimeUnit.SECONDS);
} catch (InterruptedException e) {
throw new RuntimeException(e.getMessage(), e);
}
}
public static void release(BrowserContext context) {
contextPool.release(context);
}
}
3. Playwright 配置类
在服务启动时初始化 Playwright,并在服务关闭时自动释放资源,确保高效的资源管理。
package com.litongjava.perplexica.config;
import com.litongjava.annotation.AConfiguration;
import com.litongjava.annotation.Initialization;
import com.litongjava.hook.HookCan;
import com.litongjava.perplexica.instance.PlaywrightBrowser;
@AConfiguration
public class PlaywrightConfig {
@Initialization
public void config() {
// 启动
PlaywrightBrowser.init();
// 服务关闭时,自动关闭浏览器和 Playwright 实例
HookCan.me().addDestroyMethod(() -> {
PlaywrightBrowser.close();
});
}
}
以上配置确保:
- 在服务启动时调用
PlaywrightBrowser.init()
初始化资源池。 - 在服务关闭时,通过注册的销毁方法自动关闭所有 Playwright 和浏览器实例,防止资源泄漏。
4. 测试线程是否安全
public RespBodyVo newContext2() {
String link1 = "https://studentservices.stanford.edu/calendar/academic-dates/stanford-academic-calendar-2024-2025";
//模拟20个用户
for (int i = 1; i < 20; i++) {
TioThreadUtils.submit(() -> {
//显示网页
for (int j = 0; j < 3; j++) {
BrowserContext context = PlaywrightBrowser.acquire();
try (Page page = context.newPage()) {
//显示窗口
page.navigate(link1);
String bodyText = page.innerText("body");
} catch (Exception e) {
log.error("Error getting content from {}: {}", link1, e.getMessage(), e);
}
PlaywrightBrowser.release(context);
}
try {
Thread.sleep(200000);
} catch (InterruptedException e) {
e.printStackTrace();
}
});
}
return RespBodyVo.ok();
}
5.多线程爬取网页
说明:
- 使用
@AConfiguration
和@Initialization
注解,在服务启动时自动创建Browser
实例。 - 将 Playwright 和 Chromium 浏览器的启动设置为无头模式,适合服务器环境。
- 在服务销毁时,利用
TioBootServer
的addDestroyMethod
方法,确保浏览器和 Playwright 实例被正确关闭,避免资源泄漏。
package com.litongjava.perplexica.services;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.Future;
import com.litongjava.perplexica.instance.PlaywrightBrowser;
import com.litongjava.perplexica.vo.ChatWsRespVo;
import com.litongjava.perplexica.vo.CitationsVo;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.utils.hutool.FilenameUtils;
import com.litongjava.tio.utils.thread.TioThreadUtils;
import com.litongjava.tio.websocket.common.WebSocketResponse;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.Page;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class SpiderService {
public StringBuffer spider(ChannelContext channelContext, long answerMessageId, List<CitationsVo> citationList) {
ChatWsRespVo<String> vo;
WebSocketResponse websocketResponse;
//5.获取内容
StringBuffer pageContents = new StringBuffer();
for (int i = 0; i < citationList.size(); i++) {
String link = citationList.get(i).getLink();
String suffix = FilenameUtils.getSuffix(link);
if ("pdf".equals(suffix)) {
log.info("skip:{}", suffix);
} else {
String bodyText = null;
try {
bodyText = PlaywrightBrowser.getBodyContent(link);
} catch (Exception e) {
log.error(e.getMessage(), e);
vo = ChatWsRespVo.message(answerMessageId + "", "Error Failed to get " + link + " " + e.getMessage());
websocketResponse = WebSocketResponse.fromJson(vo);
if (channelContext != null) {
Tio.bSend(channelContext, websocketResponse);
}
continue;
}
pageContents.append("source " + (i + 1) + " " + bodyText).append("\n\n");
}
}
return pageContents;
}
public StringBuffer spiderAsync(ChannelContext channelContext, long answerMessageId, List<CitationsVo> citationList) {
List<Future<String>> futures = new ArrayList<>();
for (int i = 0; i < citationList.size(); i++) {
final int index = i;
final String link = citationList.get(i).getLink();
Future<String> future = TioThreadUtils.submit(() -> {
String suffix = FilenameUtils.getSuffix(link);
if ("pdf".equalsIgnoreCase(suffix)) {
log.info("skip:{}", suffix);
return "";
} else {
BrowserContext context = PlaywrightBrowser.acquire();
try (Page page = context.newPage()) {
page.navigate(link);
String bodyText = page.innerText("body");
return "source " + (index + 1) + " " + bodyText + "\n\n";
} catch (Exception e) {
log.error("Error getting content from {}: {}", link, e.getMessage(), e);
ChatWsRespVo<String> vo = ChatWsRespVo.message(answerMessageId + "", "Error Failed to get " + link + " " + e.getMessage());
WebSocketResponse websocketResponse = WebSocketResponse.fromJson(vo);
if (channelContext != null) {
Tio.bSend(channelContext, websocketResponse);
}
return "";
} finally {
PlaywrightBrowser.release(context);
}
}
});
futures.add(future);
}
StringBuffer pageContents = new StringBuffer();
for (Future<String> future : futures) {
try {
String result = future.get();
if (result != null) {
pageContents.append(result);
}
} catch (InterruptedException | ExecutionException e) {
log.error("Error retrieving task result: {}", e.getMessage(), e);
}
}
return pageContents;
}
}
转为 markdown
通过控制器实现网页内容获取和 Markdown 转换功能。
网页内容获取控制器
package com.litongjava.playwright.controller;
import com.litongjava.annotation.RequestPath;
import com.litongjava.playwright.instance.PlaywrightBrowser;
import com.litongjava.tio.boot.http.TioRequestContext;
import com.litongjava.tio.http.common.HttpResponse;
import com.litongjava.tio.http.server.util.Resps;
import lombok.extern.slf4j.Slf4j;
@RequestPath("/playwright")
@Slf4j
public class PlaywrightController {
@RequestPath()
public HttpResponse index(String url) {
log.info("访问的 URL: {}", url);
String content = PlaywrightBrowser.getContent(url);
// 返回网页内容
return Resps.html(TioRequestContext.getResponse(), content);
}
}
Markdown 转换控制器
package com.litongjava.playwright.controller;
import com.litongjava.annotation.RequestPath;
import com.litongjava.playwright.instance.PlaywrightBrowser;
import com.litongjava.tio.boot.http.TioRequestContext;
import com.litongjava.tio.http.common.HttpResponse;
import com.litongjava.tio.http.server.util.Resps;
import com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter;
import lombok.extern.slf4j.Slf4j;
@RequestPath("/markdown")
@Slf4j
public class MarkdownController {
@RequestPath()
public HttpResponse markdown(String url) {
log.info("访问的 URL: {}", url);
String html = PlaywrightBrowser.getContent(url);
// 创建转换器实例
FlexmarkHtmlConverter converter = FlexmarkHtmlConverter.builder().build();
// 将 HTML 转换为 Markdown
String markdown = converter.convert(html);
// 返回网页内容
return Resps.html(TioRequestContext.getResponse(), markdown);
}
}
启动类
package com.litongjava.playwright;
import com.litongjava.annotation.AComponentScan;
import com.litongjava.playwright.instance.PlaywrightBrowser;
import com.litongjava.tio.boot.TioApplication;
@AComponentScan
public class PlaywrightApp {
public static void main(String[] args) {
boolean download = false;
for (String string : args) {
if ("--download".equals(string)) {
download = true;
break;
}
}
if (download) {
System.out.println("download start");
PlaywrightBrowser.getContent("https://tio-boot.litongjava.com/");
PlaywrightBrowser.close();
System.out.println("download end");
} else {
long start = System.currentTimeMillis();
TioApplication.run(PlaywrightApp.class, args);
long end = System.currentTimeMillis();
System.out.println((end - start) + "ms");
}
}
}
测试
部署服务后,可以通过以下地址进行测试:
获取网页内容:
http://localhost/playwright?url=https://www.sjsu.edu/registrar/calendar/fall-2024.php
获取 Markdown 格式的网页内容:
http://localhost/markdown?url=https://www.sjsu.edu/registrar/calendar/fall-2024.php
访问上述地址时,服务将返回指定网页的 HTML 内容或转换后的 Markdown 内容。
构建 Docker 镜像
为了简化部署过程,使用 Docker 容器化 Playwright 服务。以下是 Dockerfile 的配置示例:
# 第一阶段:构建
FROM litongjava/maven:3.8.8-jdk8u391 AS builder
WORKDIR /src
COPY pom.xml /src/
COPY src /src/src
RUN mvn package -DskipTests -Pproduction
# 第二阶段:运行
FROM litongjava/jdk:8u391-stable-slim
WORKDIR /app
COPY /src/target/playwright-server-1.0.0.jar /app/
# 安装 Chromium 浏览器
RUN apt update && apt install chromium -y && rm -rf /var/lib/apt/lists/* /var/cache/apt/archives/*
# 下载 Playwright 所需的浏览器依赖
RUN java -jar /app/playwright-server-1.0.0.jar --download
# 启动应用
CMD ["java", "-Xmx900m", "-Xms512m", "-jar", "playwright-server-1.0.0.jar"]
使用以下命令构建 Docker 镜像:
docker build -t litongjava/playwright-server:1.0.0 .
总结
本文介绍了如何将 Playwright 整合到 TioBoot 中,并通过 Docker 实现高效部署。通过在服务启动时加载 Playwright 实例,并在服务关闭时释放资源,显著提升了服务的响应性能。此方法特别适合需要频繁进行网页抓取或自动化测试的场景。
主要优势:
- 性能优化:避免每次请求都启动浏览器,减少启动开销。
- 资源管理:通过统一管理 Playwright 实例,确保资源高效利用和正确释放。
- 易于部署:使用 Docker 容器化,简化部署流程,提升环境一致性。
- 扩展性强:封装为 Web 服务,便于集成到现有系统,并对外提供标准 API 接口。
通过合理利用上述方法,可以在高并发、低延时的应用场景中,实现高效、可靠的网页数据获取服务,提升整体系统性能和用户体验。