Expedia Hotel Recommendations

Intro

안녕하세요. 저는 2016년도에 Expedia가 Kaggle에 공개한 데이터를 가지고 분석 및 시각화를 진행하였습니다. 사용한 데이터에는 2013~2014년에 Expedia를 이용한 유저들의 정보와 행동과 유저가 예약한 호텔 클러스터 ID가 저장되어 있습니다. Expedia에서 가격대, 별점, 위치 등등을 기반으로 비슷한 호텔끼리 묶어 ID를 부여하였는데, 이를 hotel cluster라고 합니다. Expedia는 어떤 유저가 어떤 hotel cluster에 있는 호텔을 예약을 하는지에 관심이 있었습니다.

Expedia Hotel Recommendations

Which hotel type will an Expedia customer book?

https://www.kaggle.com/competitions/expedia-hotel-recommendations/overview

Column Description (1)

Column name

Description

Data Type

Temestamp

string

ID of the Expedia point of sale (i.e. Expedia.com, Expedia.co.uk, Expedia.co.jp, ...)

int

ID of continent associated with site_name

int

user_location_country

The ID of the country the customer is located

int

user_location_region

The ID of the region the customer is located

int

user_location_city

The ID of the city the customer is located

int

orig_destination_distance

Physical distance between a hotel and a customer at the time of search. A null means the distance could not be calculated

double

ID of user

int

1 when a user connected from a mobile device, 0 otherwise

tinyint

1 if the click/booking was generated as a part of a package (i.e. combined with a flight), 0 otherwise

int

ID of a marketing channel

int

Checkin date

string

Checkout date

string

srch_adults_cnt

The number of adults specified in the hotel room

int

srch_children_cnt

The number of (extra occupancy) children specified in the hotel room

int

The number of hotel rooms specified in the search

int

srch_destination_id

ID of the destination where the hotel search was performed

int

srch_destination_type_id

Type of destination

int

hotel_continent

Hotel continent

int

Hotel country

int

Hotel market

int

1 if a booking, 0 if a click

tinyint

Numer of similar events in the context of the same user session

bigint

ID of a hotel cluster

int

Dataset

원본 데이터의 크기가 너무 커 분석을 하기에 어려움이 있으므로, 10만 row개의 데이터만 분리하여 사용하였습니다.

split -l 100000 train.csv
Bash
복사

시각화를 위해 파이썬에서 약간의 가공을 하였고, 원본에 없던 hotel_nights라는 값을 다음과 같이 추가하였습니다.

hotel_nights = pd.to_datetime(df["srch_co"]) - pd.to_datetime(df["srch_ci"])
df["hotel_nights_str"] = hotel_nights

hotel_nights_float = (hotel_nights / np.timedelta64(1, "D")).astype(float)
df["hotel_nights"] = hotel_nights_float
Python
복사

원본 데이터

train.csv

3975044.7KB

사용한 데이터

expedia.csv

12875.7KB

Analysis in HEARTCOUNT

Preferred continent destinations

Most of people booking are from continent 3

Putting the two above together

How many people by continent are booking from mobile

Number of booked nights as difference between check in and check out

Difference Analysis: is booking

Difference Analysis: #children