Automatic Blog Posting for Medium using Naver Blog contents — Day 2

김영석
3 min readSep 10, 2020

“A motivational idea that makes me move forward at least a little everyday.”

Continuing from the requirement analysis, I am going to talk about the Proof of Concept. I’ve got several items to prove to see if I can actually implement the derived requirements.

These are the items :

  • Is it possible to crawl Naver Blog using Scrapy based on URL patterns in someone’s account? otherwise what options are left to access specific contents ?
  • Is it possible to process the crawled data into something organized so that I can save and reuse to make Medium API call?

Okay then, let’s hand one by one.

How to access Naver Blog contents that I want to crawl?

To achieve this object, I have come up with more items to check and tried to answer them as below.

Q1. Can I get a list of blogs contents in someone’s account?

→ Yes.

Q2. How can I get a list of blogs ?

→ Option 1) I can get partial list of blogs by loading HTML page in this url pattern https://blog.naver.com/PostList.nhn?blogId=elle81054. and then I can scrape a list of blogs from the table element with “blog2_list blog2_categorylist” class. But the problem here is that I can’t get a full list of blogs because the list is paginated by default. Even url doesn’t change for the paginated contents, which leaves me no other options than going for different approach.

→ Option 2) Using RSS Feed is an alternative solution to the option1.

Fortunately, I can get contents from RSS like this.

By accessing “rss.blog.naver.com/${accounteId}.xml

I can get contents metadata.

Basically, I can get some of hints to compose JSON schema for saving blog contents from the RSS feed xml file as followings:

  • Category
  • Title
  • Blog Link
  • Description
  • Publish Date

Q3. How are you going to crawl main contents from the Blog Link?

→ As I took a visit to a blog link that I got from the RSS link, I found out some of repetitive patterns in between HTML elements like this.

Since the only contents that I need to take from a blog post are texts and images in the right sequence, it seems that I can get those from the “se-main-container” classed Div element.

all the contents inside the se-main-container div tag are classed as “se-component”. and for image content, se-image is there, and for the text content, se-text is there, and for quoted texts, se-quotation is there.

In a nutshell, using those patterns, I will extract the necessary contents in the right order and right format.

then, what’ next?

I am going to figure out how to make a POST api for blog posting in Medium.

TO BE CONTINUED…

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

김영석
김영석

Written by 김영석

I love problem solving and hate repetition of tedious tasks. I like automating, streamlining, optimizing, things.

No responses yet

Write a response