Jsoup 예제 - 그누보드 크롤링

Jsoup 을 사용해서 그누보드 게시물을 크롤링 해보는 예제를 알아보겠습니다. 게시물 본문과 본문에 포함된 이미지와 첨부파일을 다운로드하여 파일로 저장해 봅니다. Jsoup의 기본적인 사용법은 "jsoup : 자바 HTML 파서(Java HTML Parser)" 을 참조하세요.

테스트용 글이 작성되어 있는 URL 입니다. 본문에 포함된 이미지는 SmartEditor를 사용해서 업로드된 것입니다.

http://localhost:8080/bbs/board.php?bo_table=free&wr_id=1

1. 로그인 하기

공개된 글이 아닐 경우 로그인이 필요할 수 있습니다. 계정이 있으면 아래 코드로 로그인할 수 있습니다.

// 로그인 합니다.
Connection.Response res = Jsoup.connect("http://localhost:8080/bbs/login_check.php")
          .data("mb_id", "admin", "mb_password", "1111")
          .method(Method.POST)
          .execute();

// 세션 유지를 위한 세션 아이디
String sessionId = res.cookie("PHPSESSID");

// 결과 문서
Document doc = res.parse();

로그인 처리 URL이 "http://localhost:8080/bbs/login_check.php" 입니다. 그누보드를 설치한 위치에 따라 URL은 변경될 수 있습니다. data() 메소드에 로그인 정보를 보냅니다. 아이디의 폼 이름과 값이 "mb_id / admin"이고, 비밀번호의 폼 이름과 값이 "mb_pasword / 1111" 입니다. 아이디와 비밀번호는 자신의 테스트에 맞게 수정하면 됩니다. 세션 유지를 위한 세션 아이디를 저장해 둡니다. 다음부터 페이지에 접속할때 마다 이 아이디를 보내주게 됩니다.

2. 게시물 문서 가져오기

2.1 위의 로그인 후 세션을 유지하며 문서를 가져오는 방법 입니다.

Document doc = Jsoup.connect("http://localhost:8080/bbs/board.php?bo_table=free&wr_id=1")
			.cookie("PHPSESSID", sessionId)
			.get();

2.2 공개글에 로그인 없이 문서를 가져옵니다.

Connection.Response res = Jsoup.connect("http://localhost:8080/bbs/board.php?bo_table=free&wr_id=1")
          .method(Method.GET)
          .execute();

// 첨부파일 다운로드를 위해 세션 유지
String sessionId = res.cookie("PHPSESSID");

// 문서 가져오기
Document doc = res.parse();

그누보드에서 첨부 파일 다운로드를 위해서는 세션이 유지되어야 합니다. 글 읽을때 세션에 정보를 입력해두고 첨부 다운로드 페이지에서 글을 읽고 다운로드 하는지 체크합니다.

3. 제목 가져오기

// 제목 부분 HTML
<h2 id="bo_v_title">
    <span class="bo_v_tit">첫번째 글입니다.</span>
</h2>

// 제목 가져오기 코드
Element titleElement = doc.getElementById("bo_v_title");
if(titleElement != null) {
    System.out.println("제목 : " + titleElement.text());
}

그누보드 기본 스킨에서 제목을 포함하고 있는 태그의 id 가 "bo_v_title" 입니다. 내부에 <span> 태그의 데이터를 가져 오려면 doc.getElementByClass("bo_v_titl"); 를 사용할 수 있습니다.

둘의 차이점은 id 로 가져오면 하나의 엘리먼트를 반환하고, 없으면 null 이 나올 수 있으므로 null 체크를 해야 합니다. class 로 가져오면 배열로 나오므로 해당하는 엘리먼트가 없으면 길이가 0인 배열이 나옵니다.

제목을 HTML을 제거하고 TEXT만 가져오기 위해서 titleElement.text() 메소드를 사용했습니다.

4. 본문 가져오기

<!-- 본문 내용 시작 { -->
<div id="bo_v_con">
<p>첫번째 글 입니다.</p>
<p><br /></p>
<p><a href="http://localhost:8080/bbs/view_image.php?fn=%2Fdata%2Feditor%2F1809%2F17321337ca0260d26993c534c0989de5_1536242820_0243.jpg" target="_blank" class="view_image">
<img src="http://localhost:8080/data/editor/1809/thumb-17321337ca0260d26993c534c0989de5_1536242820_0243_400x264.jpg" alt="17321337ca0260d26993c534c0989de5_1536242820_0243.jpg"/>
</a>
<br style="clear:both;" /> TEST</p>
</div>
<!-- } 본문 내용 끝 -->


// 본문 가져오기 코드
Element contentElement = doc.getElementById("bo_v_con");
if(contentElement != null) {

    // 글에 포함된 이미지 다운로드
    Elements imageElements = contentElement.getElementsByTag("img");
    for(int i = 0; i < imageElements.size(); i++) {
        ...
    }

    // 본문을 파일로 저장한다.
    saveContent(contentElement.html(), "content.html");
}

// 본문 파일로 저장
private void saveContent(String content, String fileName) {
    String savePath = "D:/download/";

    FileOutputStream fos = null;
    OutputStreamWriter osw = null;
    PrintWriter pw = null;
    try {
        File file = new File(savePath + fileName);
        fos = new FileOutputStream(file);
        osw = new OutputStreamWriter(fos, "UTF-8");
        pw = new PrintWriter(osw);
        pw.write(content);
        pw.flush();
    } catch(Exception e) {
        e.printStackTrace();
    } finally {
        if(pw != null) try { pw.close(); } catch(Exception ignore) {}
        if(fos != null) try { fos.close(); } catch(Exception ignore) {}
    }
}

글 내용이 본문은 id 가 "bo_v_con"인 <div> 태그 안에 있습니다. 본문은 포함된 HTML을 유지하기 위해서 contentElement.html() 함수를 사용해서 HTML 그대로 가져옵니다.

5. 첨부 이미지 가져오기

// img 태그의 배열을 가져옵니다.
Elements imageElements = contentElement.getElementsByTag("img");

// 모든 배열 요소에 대해 다운로드 처리를 합니다.
for(int i = 0; i < imageElements.size(); i++) {
    Element image = imageElements.get(i);

    // src 속성의 값에서 파일 이름을 잘라냅니다.
    String url = image.attr("src");
    String[] splitUrl = image.attr("src").split("/");
    String fileName = splitUrl[splitUrl.length-1];

    // 파일을 다운로드 합니다.
    this.saveAttach(url, fileName, sessionId);
    
    // img 태그의 src 속성을 다운로드한 파일로 수정합니다.
    image.attr("src", fileName);
}

// 파일을 다운로드 합니다.
private void saveAttach(String url, String fileName, String sessionId) {
    // 파일 저장 위치
    String savePath = "D:/download/";

    File dir = new File(savePath);
    if(!dir.exists()) {
        dir.mkdirs();
    }

    URL fileUrl = null;
    URLConnection urlConn = null;
    InputStream is = null;
    OutputStream os = null;

    int readBytes = 0;
    byte[] buf = new byte[4096];

    try {
        fileUrl = new URL(url);
        urlConn = fileUrl.openConnection();
        // 세션 유지를 위해서 세션 아이디를 보냅니다.
        urlConn.setRequestProperty("Cookie", "PHPSESSID="+sessionId);
        urlConn.connect();

        // 응답으로 부터 Content-Disposition 헤더를 가져옵니다.
        // 필요에 따라 이 헤더값으로부터 파일명을 추출해서 저장에 사용할 수 있습니다.
        String contentDisposition =urlConn.getHeaderField("Content-Disposition");
        System.out.println("Content-Disposition : " + contentDisposition);

        is = urlConn.getInputStream();

        os = new BufferedOutputStream(new FileOutputStream(savePath + fileName));

        while((readBytes = is.read(buf)) != -1) {
            os.write(buf, 0, readBytes);
        }
    } catch(Exception e) {
        System.out.println("파일 저장 에러 : " + e.getMessage());
    } finally {
        if(is != null) try { is.close(); } catch(Exception ignore) {}
        if(os != null) try { os.close(); } catch(Exception ignore) {}
    }
}

본문에서 <img> 태그를 가진 모든 요소를 추출해서 이미지를 다운로드하고, <img> 태그의 "src" 속성의 값을 다운로드된 폴더(현재 폴더) 로 수정합니다.

6. 첨부파일을 다운로드 합니다.

<!-- 첨부파일 시작 { -->
<section id="bo_v_file">
    <h2>첨부파일</h2>
    <ul>
        <li>
            <i class="fa fa-download" aria-hidden="true"></i>
            <a href="http://localhost:8080/bbs/download.php?bo_table=free&amp;wr_id=1&amp;no=0" class="view_file_download">
                <strong>0043_27.zip</strong>
            </a>
        </li>
    </ul>
</section>
<!-- } 첨부파일 끝 -->


// 첨부파일 다운로드
Elements fileElements = doc.getElementsByClass("view_file_download");
for(int i = 0; i < fileElements.size(); i++) {
    Element file = fileElements.get(i);
    this.saveAttach(file.attr("href"), file.text(), sessionId);
}

첨부파일은 "view_file_download" 클래스를 가지는 <li> 태그 입니다. 파일의 url 은 file.attr("href") 로 <a> 태그의 "href" 속성을 가져옵니다. 파일명은 <a> 태그의 텍스트 부분을 file.text() 메소드로 가져옵니다.

필요하면 다운로드시 응답의 "Content-Dispositioin" 헤더를 파싱해서 파일명을 가져올 수 있습니다.

7. 다운로드된 결과

그누보드의 기본 스킨에서 본문에 포함된 이미지는 썸네일 이미지 입니다. 원본 이미지를 다운로드 받으려면 <img> 태그를 둘러싼 <a> 태그의 URL을 가져와서 원본이미지를 다운로드 할 수 있습니다.

첨부파일의 경우 세션을 유지해야 하므로 그 부분을 신경써야 합니다. 다음은 다운받은 분문내용 입니다. 제목을 추가하고 첨부파일 링크를 만들어 주면 더 좋을 것 갈습니다. 관리를 편리하게 하기 위해서는 본문의 내용을 DB로 저장하는 방법도 있겠습니다.

다운받은 본문을 브라우저로 본 이미지 입니다.

Jsoup 예제 - 그누보드 크롤링

jsoup : 자바 HTML 파서(Java HTML Parser)

저작자표시

'프로그래밍 > 자바' 카테고리의 다른 글

Java에서 JSON 문자열 생성 및 JSON 문자열을 자바 객체로 변환하기 (4)	2019.01.18
Java에서 HashMap 복사하기 와 모든 키, 값을 리스트 하기 (0)	2019.01.15
자바(Java)로 파일의 마임 타입(MIME Type) 확인하기 (0)	2018.08.21
Java Generic사용법과 Generic에서의 와일드카드 (0)	2018.08.12
Open JDK 9 와 이클립스 설치하기 (0)	2018.06.30

쉬고 싶은 개발자

Jsoup 예제 - 그누보드 크롤링

'프로그래밍 > 자바' 카테고리의 다른 글

티스토리툴바