크롤링 접근 규약. (robot.txt)

로봇 배제 표준 (Robots exclusion standard)

웹사이트에 로봇이 접근하는 것을 방지하기 위한 규약으로 일반적으로 로봇의 접근을 제한하는
내용을 robots.txt 파일에 기술되어 있다.
(웹사이트 URL 주소) /robots.txt
User-agent : * (모든 로봇) , User-agent : bingbot (bingbot이라는 이름의 로봇)
Disallow : / (모든 경로 불허)
Allow : /$ (최상위 경로 허가)

User-agent : bingbot
Disallow : /

bingbot은 모든 경로를 크롤링하는 행위를 허락하지 않는다.

User-agent : *
Disallow : /owner

모든 로봇은 /owner 경로를 크롤링하는 행위를 허락하지 않는다.

User-agnet : bingbot
Crawl-delay : 10

10초마다 접근 가능하며, 하루 24시간동안 8,640번 접근을 허용

주요 웹사이트 별 크롤링 규제

Daum

User-agent: *
Disallow: /

다음은 모든 유저, 모든 페이지에 대해 접근을 거부

NAVER

User-agent: *
Disallow: /
Allow : /$

네이버에서는 첫페이지를 제외한 모든 문서에 대한 접근을 거부

Gmarket

User-agent: *
Allow: /

G마켓은 모든 웹사이트에 대해서 접근을 허용

GOOGLE

User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$

...(생략)

구글은 허용되는 페이지와 허용되지 않는 페이지가 구분

Github

If you would like to crawl GitHub contact us via https://support.github.com/contact/
We also provide an extensive API: https://developer.github.com/
User-agent: baidu
crawl-delay: 1
User-agent: *

Disallow: /*/pulse
Disallow: /*/tree/
Disallow: /*/wiki*
Disallow: /gist/
...

깃헙의 경우에 크롤링을 하고 싶다면 연락을 하라, api를 제공하고 있다.
baidu 라는 이름의 bot에게 1초마다 크롤링 할 수 있도록 제한

로봇 배제 표준 (Robots exclusion standard)

주요 웹사이트 별 크롤링 규제

검색 태그

티스토리툴바