“A motivational idea that makes me move forward at least a little everyday.”
Continuing from the requirement analysis, I am going to talk about the Proof of Concept. I’ve got several items to prove to see if I can actually implement the derived requirements.
These are the items :
- Is it possible to crawl Naver Blog using Scrapy based on URL patterns in someone’s account? otherwise what options are left to access specific contents ?
- Is it possible to process the crawled data into something organized so that I can save and reuse to make Medium API call?
Okay then, let’s hand one by one.
How to access Naver Blog contents that I want to crawl?
To achieve this object, I have come up with more items to check and tried to answer them as below.
Q1. Can I get a list of blogs contents in someone’s account?
→ Yes.
Q2. How can I get a list of blogs ?
→ Option 1) I can get partial list of blogs by loading HTML page in this url pattern https://blog.naver.com/PostList.nhn?blogId=elle81054. and then I can scrape a list of blogs from the table element with “blog2_list blog2_categorylist” class. But the problem here is that I can’t get a full list of blogs because the list is paginated by default. Even url doesn’t change for the paginated contents, which leaves me no other options than going for different approach.

→ Option 2) Using RSS Feed is an alternative solution to the option1.
Fortunately, I can get contents from RSS like this.
By accessing “rss.blog.naver.com/${accounteId}.xml
I can get contents metadata.

Basically, I can get some of hints to compose JSON schema for saving blog contents from the RSS feed xml file as followings:
- Category
- Title
- Blog Link
- Description
- Publish Date
Q3. How are you going to crawl main contents from the Blog Link?
→ As I took a visit to a blog link that I got from the RSS link, I found out some of repetitive patterns in between HTML elements like this.

Since the only contents that I need to take from a blog post are texts and images in the right sequence, it seems that I can get those from the “se-main-container” classed Div element.
all the contents inside the se-main-container div tag are classed as “se-component”. and for image content, se-image is there, and for the text content, se-text is there, and for quoted texts, se-quotation is there.
In a nutshell, using those patterns, I will extract the necessary contents in the right order and right format.
then, what’ next?
I am going to figure out how to make a POST api for blog posting in Medium.
TO BE CONTINUED…