파이썬 PPT 파일 내용 검색하기

Notice

Recent Posts

Recent Comments

Link

빠리의 택시 운전사사

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

ioerror

파이썬 PPT 파일 내용 검색하기 본문

Python

파이썬 PPT 파일 내용 검색하기...

상황

서버에 업로드된 파워포인트 파일 중에 특정 문자열이 포함된 파일을 찾아야 한다.

그런데 파일이 몇천 개가 넘기에 지정된 날짜에 업로드된 파일 중에서만 찾기로 했다.

먼저 해당 기간내에 생성된 파일을 특정 폴더에 옮긴 후 다운로드하여서 피씨에서 찾기로 했다.

지정된 기간에 생성된 파일을 지정된 디렉토리로 옮기기

import os
import shutil
from datetime import datetime
'''
dt = datetime.now()
print(dt.date())
'''
'''
#날짜를 입력 받아서 찾을때 
inputDate = datetime.strptime(str(input('Searching Input Start Date : ')), '%Y%m%d')
min_dt = inputDate.date()
inputDate = datetime.strptime(str(input('Searching Input End Date : ')), '%Y%m%d')
max_dt = inputDate.date()
'''
min_dt = datetime(2021, 01, 01).date()
max_dt = datetime(2021, 04, 01).date()
for (path, dir, files) in os.walk(r'./upload'):
	for filename in files:
		# 업로드(생성된) 날짜 가져오기
		fileCtime = datetime.fromtimestamp(os.path.getctime(path+'/'+filename))
		# 수정된 날짜 가져오기
		#fileMtime = datetime.fromtimestamp(os.path.getmtime(path+'/'+filename))
		filedate = fileCtime.date()
		file_nm_pre = filename[0:6]
		if filedate <= max_dt and filedate >= min_dt and file_nm_pre == 'qcdocs':
			# 파일을 복사
			shutil.copy2("/upload/"+filename, "/findfile/"+filename)
			# 복사된 파일명 출력
            print('Path : [%s], Filename : [%s], Date : [%s]' %(path,filename,fileCtime))

다음에 유사한 상황이 발생할 경우를 대비해 검색기간을 입력받을 수 있도록 했지만 기간을 알기에 코드에 지정했다.

파일의 생성날짜는 os.path.getctime() 을 이용하고 수정 날짜는 os.path.getmtime 을 이용한다.

지정된 기간 내에 생성된 파일이며 shutil.copy2() 를 이용해서 파일을 지정된 디렉토리에 복사한다.

파워포인트 파일에서 지정된 문자 찾기

먼저 파워포인트 모듈을 설치한다.

pip install python-pptx

여기서 알아둬야 할 것이 ". ppt" 파일은 python-pptx에 인식이 안된다.

그러니까 Office 2007 이전 버전에서 작성했거나 ppt로 저장된 파일은 인식이 안되며, 또 구글 프레젠테이션에서 작성되어 저장된 pptx도 인식이 안된다. 사방팔방으로 찾아보고 테스트해봤는데 방법이 없다.

이 글을 보시는 분 중에 혹시라도 방법이 있다면 코멘트 바랍니다.

파이썬의 Presentation에서 문자열 추출은 슬라이드 안의 모든 개체(Shape?) 별로 추출하게 되는데 PPT는 다양한 개체가 있고 그룹으로 묶거나 하면 찾기가 힘들어진다.

다행히도 찾는 PPT 파일은 일정한 양식이 있어서 그 기준으로 문자열을 추출했다.

나머지는 아래 코드를 참고하면 된다.

import os
import shutil
from datetime import datetime
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

'''
검색 키워드 역시 입력 받을 수 있으며, 콤마로 구분하여 리스트로 저장한다.
inputKeywords = str(input('검색 키워드 : '))
keywords = inputKeywords.split(',')
'''
keywords = ['검색어', '검색어1']

'''
문자열 개행 삭제, 만약 파일내 문자열 중 검색 대상 단어이지만 
중간에 개행이 있는 경우 개행을 지우고 검색어로 추출
예 : 검색 키워드가 "하늘나라" 인데 문자열이 "하늘\n나라" 이면 검색이 안됨
이것을 "하늘나라"로 변환
'''
def stripnl(str):
    str = str.strip("\n")
    str = str.splitlines()
    return "".join(str)

'''
요청된 파일에서 문자열을 모아서 리스트로 반환 한다.
'''
def pptx_collect(pptx_file):
    inner_list = [] # 문자열 리스트
    prs = Presentation(pptx_file)
    # PPT 문서의 슬라이드 개수 만큼 돌린다.
    for slide in prs.slides:
        # 그룹으로 묶여 있는 개체를 찾는다.
        group_shapes = [
            shp for shp in slide.shapes
            if shp.shape_type == MSO_SHAPE_TYPE.GROUP
        ]
        # 그룹 안의 개체에 담겨 있는 문자열을 inner_list 에 담는다.
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    inner_list.append(stripnl(shape.text))
        '''
        table_shapes = [
            shp for shp in slide.shapes
            if shp.shape_type == MSO_SHAPE_TYPE.TABLE
        ]

        for table_shape in table_shapes:
            for shape in table_shape.shapes:
                if shape.has_text_frame:
                    inner_list.append(stripnl(shape.text))
        '''
        # 그룹이 아닌 일반 개체들을 검색 한다.
        for shape in slide.shapes:
            # 개체가 테이블인 경우
            if shape.has_table:
                table = shape.table
                for row in table.rows:
                    for cell in row.cells:
                        txt = cell.text.strip()
                        if txt != '':
                            inner_list.append(stripnl(txt))
            # 테이블은 아니지만 일반 개체 인경우 문자열이 없다면 스킵
            if not shape.has_text_frame:
                continue
            # 문자열이 있는 개체이면 문자열을 inner_list에 담는다.
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    txt = run.text.strip()
                    if txt != '':
                        inner_list.append(stripnl(run.text))
    # 문자열 리스트를 반환 한다.
    return (inner_list)
    
'''
위에서 지정된 기간의 파일이 있는 폴더에서 ppt 파일들에 문자열이 있는지 확인
'''
for (path, dir, files) in os.walk(r'./findfiles/'):
    for filename in files:
        # 파일명과 확장자 정보를 가져온다.
        file_nm_r, file_xtd = os.path.splitext(filename)
        # 확장자가 .pptx가 아니면 스킵
        if file_xtd not in ['.pptx']:
            continue
        cnt_pptx += 1
        # ppt 파일의 문자열 리스트를 추출한다.
        arr_texts = pptx_collect(filename)
        # 각 키워드가 추출된 문자열 리스트에 있는지 확인한다.
        for nm_p in keywords:
            # 검색 키워드가 문자열에 있다면 파일명 출력
            if nm_p in arr_texts:
                print(filename)
            else:
                for texts in arr_texts:
                    # texts = texts.upper() 영문인 경우 대문자로 변환해서 확인
                    # 문자열에 키워드가 있으면 파일명 출력
                    if texts.find(nm_p) != -1:
                        print(filename)

print(cnt_pptx)

최종 코드

기간 내 생성된 파일 복사 및 검색 키워드가 있는 PPT 파일 찾기

import os
import shutil
from datetime import datetime
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE

'''
필요한 경우 검색일 범위를 아래와 같이 입력 받을 수 있다.
inputMinDate = datetime.strptime(str(input('검색 시작일 : ')), '%Y%m%d')
min_dt = inputMinDate.date()

inputMaxDate = datetime.strptime(str(input('검색 종료일 : ')), '%Y%m%d')
min_dt = inputMaxDate.date()
'''
min_dt = datetime(2021, 4, 1).date()
max_dt = datetime.now().date()

'''
검색 키워드 역시 입력 받을 수 있으며, 콤마로 구분하여 리스트로 저장한다.
inputKeywords = str(input('검색 키워드 : '))
keywords = inputKeywords.split(',')
'''
keywords = ['검색어', '검색어1']

for (path, dir, files) in os.walk(r'./upload'):
    for filename in files:
        # getctime : 생성일 확인
        fileCtime = datetime.fromtimestamp(os.path.getctime(path + '/' + filename))
        # getmtime : 최종 수정일 확인
        # fileMtime = datetime.fromtimestamp(os.path.getmtime(path+'/'+filename))
        filedate = fileCtime.date()
        file_nm_pre = filename[0:6]
        if filedate <= max_dt and filedate >= min_dt:
            print('Path : [%s], Filename : [%s], Date : [%s]' % (path, filename, fileCtime))
            shutil.copy2("./upload/" + filename, "./findfiles/" + filename)

'''
파일 개수
'''
cnt_pptx = 0

'''
문자열 개행 삭제, 만약 파일내 문자열 중 검색대상 단어이지만 
중간에 개행이 있는 경우 개행을 지우고 검색어로 추출
예 : 검색 키워드가 "하늘나라" 인데 문자열이 "하늘\n나라" 이면 검색이 안됨
이것을 "하늘나라"로 변환
'''
def stripnl(str):
    str = str.strip("\n")
    str = str.splitlines()
    return "".join(str)

'''
요청된 파일에서 문자열을 모아서 리스트로 반환 한다.
'''
def pptx_collect(pptx_file):
    inner_list = [] # 문자열 리스트
    prs = Presentation(pptx_file)
    # PPT 문서의 슬라이드 개수 만큼 돌린다.
    for slide in prs.slides:
        # 그룹으로 묶여 있는 개체를 찾는다.
        group_shapes = [
            shp for shp in slide.shapes
            if shp.shape_type == MSO_SHAPE_TYPE.GROUP
        ]
        # 그룹 안의 개체에 담겨 있는 문자열을 inner_list 에 담는다.
        for group_shape in group_shapes:
            for shape in group_shape.shapes:
                if shape.has_text_frame:
                    inner_list.append(stripnl(shape.text))
        '''
        table_shapes = [
            shp for shp in slide.shapes
            if shp.shape_type == MSO_SHAPE_TYPE.TABLE
        ]

        for table_shape in table_shapes:
            for shape in table_shape.shapes:
                if shape.has_text_frame:
                    inner_list.append(stripnl(shape.text))
        '''
        # 그룹이 아닌 일반 개체들을 검색 한다.
        for shape in slide.shapes:
            # 개체가 테이블인 경우
            if shape.has_table:
                table = shape.table
                for row in table.rows:
                    for cell in row.cells:
                        txt = cell.text.strip()
                        if txt != '':
                            inner_list.append(stripnl(txt))
            # 테이블은 아니지만 일반 개체 인경우 문자열이 없다면 스킵
            if not shape.has_text_frame:
                continue
            # 문자열이 있는 개체이면 문자열을 inner_list에 담는다.
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    txt = run.text.strip()
                    if txt != '':
                        inner_list.append(stripnl(run.text))
    # 문자열 리스트를 반환 한다.
    return (inner_list)


'''
위에서 지정된 기간의 파일이 있는 폴더에서 ppt 파일들에 문자열이 있는지 확인
'''
for (path, dir, files) in os.walk(r'./findfiles/'):
    for filename in files:
        # 파일명과 확장자 정보를 가져온다.
        file_nm_r, file_xtd = os.path.splitext(filename)
        # 확장자가 .pptx가 아니면 스킵
        if file_xtd not in ['.pptx']:
            continue
        cnt_pptx += 1
        # ppt 파일의 문자열 리스트를 추출한다.
        arr_texts = pptx_collect(filename)
        # 각 키워드가 추출된 문자열 리스트에 있는지 확인한다.
        for nm_p in keywords:
            # 검색 키워드가 문자열에 있다면 파일명 출력
            if nm_p in arr_texts:
                print(filename)
            else:
                for texts in arr_texts:
                    # texts = texts.upper() 영문인 경우 대문자로 변환해서 확인
                    # 문자열에 키워드가 있으면 파일명 출력
                    if texts.find(nm_p) != -1:
                        print(filename)

print(cnt_pptx)